Re: [R] [Fwd: Re: Organisation of medium/large projects with multiple analyses]

From: David Farrar <>
Date: Tue 31 Oct 2006 - 02:58:15 GMT


  It's good to see this sort of thing discussed.    

  For my current approach, I keep a fairly static directory for function libraries,   another one for large data sets, and others for projects. I try to define "tasks" (probably like your analyses) within "projects." There is a project folder with a lot of task sub-folders. For data sets I usually write a special pre-processing program that generates a workspace to be read in by other programs. I embed dates in the names of my task folders.    

  For Monte Carlo experiments, I rely on automated systems for naming workspaces. When the simulation parameters are changed, the file (a workspace) where simulation results are stored is renamed automatically. Otherwise, I really can't keep track of things acurately. Addtional programs are used for post-processing simulation output.    

  My modifications of functions have for the most part involved addition of new arguments, with defaults, and so this has not caused me much problem in practice, although my functions do tend to evolve.    

  One thing I find helpful is, after I write a report, I revise my file names for R programs so that the figure numbers from the report are embedded in the file names. That way it is easy to start from the report and find the code that generates particular figures.    


Mark Wardle <> wrote:   

Daniel Elliott wrote:
> Mark,
> It sounds like your data/experiment storage and organization needs are
> more complicated than mine, but I'll share my methodology...

Many thanks for this, and for the other replies received off-list. It is much appreciated, and confirms that with something as generically applicable as R, with as many widespread and heterogeneous uses, there is no universal solution.

> I'm still new to R, but have a fair experience with general programming.
> All of my data is stored in postgresql, and I have a number of R files
> that generate tables, results, graphs etc. These are then available to
> be imported into powerpoint/latex etc.
> I'm using version control (subversion), and as with most small projects,
> now have an ever increasing number of R scripts, each with fairly
> specific features.
> I only use version control for generic code. For me, generic code is
> not at the experiment level but at the "algorithm" level. It is only
> code that others would find useful - code that I hope to release to the
> R community. I use object-oriented programming to simplify the more
> specific, experiment-level scripts that I will describe later. These
> objects include plotting and data import/export among other things.
> Like you, many of my experiments are variations on the same theme. I
> have attempted general functions that can run many different experiments
> with changes only to parameters, but I have found this far too cumbersome.
> I am now resigned to storing all code and input and generated output
> data and graphs together in a single directory for each experiment with
> the exception of my general libraries. This typically consists of me
> copying the scripts that ran other experiments into a new directory
> where they are (hopefully only slightly) modified to fit the new
> experiment. I wish I had a cooler way to handle all of this, but this
> does make it very easy to rerun stuff. I even create new files, but not
> necessarily new directories, for scripts that differ only in the
> parameters they used when calling functions from my libraries.

I suppose these can either be factored out into more generic functions (time consuming, and maybe not useful in the longer-term), or you should use version control to create branches, and then if you improve the copy of a function in one experiment, you have the potential of automatically merging back your changes to other branches.
> Do you go to the effort of creating a library that solves your
> particular problem, or only reserve that for more generic functionality?
> I only use libraries and classes for code that is generic enough to be
> usable by rest of the R community.
> Do people keep all of their R scripts for a specific project separate,
> or in one big file?
> Files for a particular project are kept in many different directories
> with little structure. Experiment logs (like informal lab reports) are
> used if I need to revisit or rerun an experiment. By the way, I back
> all of this stuff onto tape drive or DVD.
> I can see advantages (knowing it all works) and
> disadvantages (time for it all to run after minor changes) in both
> approaches, but it is unclear to me which is "better". I do know that
> I've set-up a variety of analyses, moved on to other things, only to
> find later on that old scripts have stopped working because I've changed
> some interdependency. Does anyone go as far as to use test suites to
> check for sane output (apart from doing things manually)? Note I'm not
> asking about how to run R on all these scripts, as people have already
> suggested makefiles.
> I try really really really hard to never change my libraries. If I need
> to modify on the algorithms in a library I create a new method within
> the same library. Since you use version control (which is totally
> awesome, do you use it for your writing as well) hopefully you will be
> able to quickly figure out why an old script doesn't work (in theory
> should only be caused by function name changes).

My whole project is stored in Subversion. Even my data collection forms (that are in MS Word format), as it lets me branch, and lets me rewind to see what has been sent. I'm afraid I even include my filemaker databases, as it means I have a rolling backup. Plus, a big advantage is that I can keep all my work files on two separate computers, and I keep the two in synchrony automatically by judicious updating and merging. My main writing is in LaTeX and clearly version control excels for these plain text documents. I really would recommend it! I've used TortoiseSVN on Windows, and it works superbly, although on my primary machine (Mac), I just use the command line.

My scripts tend to break because I fiddle with the database schema to support some new analysis, and then when I revisit old scripts they tend to work. Based on all the advice, I shall have to factor out the database connection and query functions and use "source()" to include them in higher-level scripts.

> I realise these are vague high-level questions, and there won't be any
> "right" or "wrong" answers, but I'm grateful to hear about different
> strategies in organising R analyses/files, and how people solve these
> problems? I've not seen this kind of thing covered in any of the
> textbooks. Apologies for being so verbose!
> Not sure one could be TOO verbose here! I am constantly looking for
> bulletproof ways to manage these complex issues. Sadly, in the past, I
> may have done so to a fault. I feel that the use of version control for
> generic code and formal writing is very important.

Many thanks,

Best wishes,


Dr. Mark Wardle
Clinical research fellow and Specialist Registrar in Neurology,

______________________________________________ mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]

______________________________________________ mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.
Received on Tue Oct 31 14:05:53 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Tue 31 Oct 2006 - 04:30:13 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.