Re: [R] R on a supercomputer

From: Sean Davis <>
Date: Tue 11 Oct 2005 - 21:01:23 EST

On 10/10/05 3:54 PM, "Kimpel, Mark William" <> wrote:

> I am using R with Bioconductor to perform analyses on large datasets
> using bootstrap methods. In an attempt to speed up my work, I have
> inquired about using our local supercomputer and asked the administrator
> if he thought R would run faster on our parallel network. I received the
> following reply:
> "The second benefit is that the processors have large caches.
> Briefly, everything is loaded into cache before going into the
> processor. With large caches, there is less movement of data between
> memory and cache, and this can save quite a bit of time. Indeed, when
> programmers optimize code they usually think about how to do things to
> keep data in cache as long as possible.
> Whether you would receive any benefit from larger cache depends on how
> R is written. If it's written such that data remain in cache, the
> speed-up could be considerable, but I have no way to predict it."
> My question is, "is R written such that data remain in cache?"

Using the cluster model (which may or may not be what you are calling a supercomputer--I don't know the exact terminology here), jobs that involve repetitive, independent tasks like computing statistics on bootstrap replicates can benefit from parallelization IF the "I/O" associated with running the single replicate does not outweigh the benefit of using multiple processors. For example, if you are running 10000 replicates and each takes 1 ms, then you have a 10 second job on a single processor. One could envision spreading that same process over 1000 processors and doing the job in 10 ms, but if one counts the I/O (network, moving into cache, etc.) which could take 1 second per batch of replicates (for example), then that job will take AT LEAST 10 seconds with 1000 processors, also. However, if the same computation takes 1 second per replicate, then the whole job takes 10,000 seconds on a single processor, but only about 11 seconds on the 1000 processors (approximately). This rationale is only approximate, but I hope it shows the point.

We have begun to use a 60-node linux cluster for some of our work (also microarray-based) and use MPI/snow with very nice results for multiple independent, long-running tasks. Snow is VERY easy to use, but one could also drop back to the Rmpi if needed, to have finer-grain control over the parallelization process.

As for how caching behaviors come into it and how R without "parallelized" R-code would perform, I can't really comment; my experience is limited to the "cluster" model with parallelized R-code.

Sean mailing list PLEASE do read the posting guide! Received on Tue Oct 11 21:19:28 2005

This archive was generated by hypermail 2.1.8 : Sun 23 Oct 2005 - 18:40:14 EST