Re: [R] gc() and memory efficiency

From: Dirk Eddelbuettel <edd_at_debian.org>
Date: Mon, 4 Feb 2008 20:36:42 -0600

On 4 February 2008 at 20:45, Doran, Harold wrote:
| I have a program which reads in a very large data set, performs some analyses, and then repeats this process with another data set. As soon as the first set of analyses are complete, I remove the very large object and clean up to try and make memory available in order to run the second set of analyses. The process looks something like this:
|
| 1) read in data set 1 and perform analyses
| rm(list=ls())
| gc()
| 2) read in data set 2 and perform analyses
| rm(list=ls())
| gc()
| ...
|
| But, it appears that I am not making the memory that was consumed in step 1 available back to the OS as R complains that it cannot allocate a vector of size X as the process tries to repeat in step 2.
|
| So, I close and reopen R and then drop in the code to run the second analysis. When this is done, I close and reopen R and run the third analysis.
|
| This is terribly inefficient. Instead I would rather just source in the R code and let the analyses run over night.
|
| Is there a way that I can use gc() or some other function more efficiently rather than having to close and reopen R at each iteration?

I haven't found one.

Every (trading) I process batches of data with R, and the only reliable way I have found is to use fresh R sessions. Otherwise, the fragmented memory will eventually result in the all-too-familiar 'cannot allocate X mb' for rather small values of X relative to my total ram. C'est la vie.

As gc() seems to help somewhat yet not 'sufficiently', fresh starts are an alternative help, And Rscript starts faster than the main R. Now, I happen to be partial to littler [1] which starts even faster, so I use that ( on Linux and am not sure if it can be built on Windows as we embed R directly and hence start faster than Rscript). But either one should help you with some batch files -- given you a way to run overnight. And once you start batching things, it is only a small step to regain efficiency by parallel execution using something like MPI or NWS

Hth, Dirk

[1] littler is the predecessor to Rscript by Jeff and myself. See either

        http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/LittleR     or

        http://dirk.eddelbuettel.com/code/littler.html     for more on littler and feel free to email us.

-- 
Three out of two people have difficulties with fractions.

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Tue 05 Feb 2008 - 02:42:25 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 05 Feb 2008 - 03:30:12 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive