[R] Organisation of medium/large projects with multiple analyses

From: Mark Wardle <mark_at_wardle.org>
Date: Thu 26 Oct 2006 - 21:00:30 GMT

Dear all,

I'm still new to R, but have a fair experience with general programming. All of my data is stored in postgresql, and I have a number of R files that generate tables, results, graphs etc. These are then available to be imported into powerpoint/latex etc.

I'm using version control (subversion), and as with most small projects, now have an ever increasing number of R scripts, each with fairly specific features. With any enlarging project, there are always issues regarding interdependencies, shared commonality (eg accessing same data store), and old scripts stopping working with changes made elsewhere (eg to data schema). For example, I might have a specific inclusion and exclusion criteria for patients, and this SQL query may have to be included in a number of analyses; I'm tempted to factor this out into a project-specific data access library, but is that over the top?

This is a very long-winded and roundabout way of asking people how they organise medium-sized projects? Do people create their own "libraries" for specific projects for shared functionality, or do people just liberally use "source()" for this kind of thing? What about namespaces? I've got unwieldy sounding functions like ataxia.repeats.plot.alleles() and often these functions are not particularly generic, and are only called three or four times, but they do save repetition.

Do you go to the effort of creating a library that solves your particular problem, or only reserve that for more generic functionality? Do people keep all of their R scripts for a specific project separate, or in one big file? I can see advantages (knowing it all works) and disadvantages (time for it all to run after minor changes) in both approaches, but it is unclear to me which is "better". I do know that I've set-up a variety of analyses, moved on to other things, only to find later on that old scripts have stopped working because I've changed some interdependency. Does anyone go as far as to use test suites to check for sane output (apart from doing things manually)? Note I'm not asking about how to run R on all these scripts, as people have already suggested makefiles.

I realise these are vague high-level questions, and there won't be any "right" or "wrong" answers, but I'm grateful to hear about different strategies in organising R analyses/files, and how people solve these problems? I've not seen this kind of thing covered in any of the textbooks. Apologies for being so verbose!

Best wishes,


Dr. Mark Wardle
Clinical research fellow and Specialist Registrar in Neurology,
C2-B2 link, Cardiff University, Heath Park, CARDIFF, CF14 4XN. UK

R-help@stat.math.ethz.ch mailing list
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Fri Oct 27 07:08:29 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Sun 29 Oct 2006 - 06:30:13 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.