Re: [R] assigning and saving datasets in a loop, with names changing with "i"

From: Tony Plate <tplate_at_acm.org>
Date: Fri, 21 Dec 2007 17:55:10 -0700

Marie Pierre Sylvestre wrote:
> Dear R users,
>
> I am analysing a very large data set and I need to perform several data
> manipulations. The dataset is so big that the only way I can play with it
> without having memory problems (E.g. "cannot allocate vectors of size...")
> is to write a batch script to:
>
> 1. cut the data into pieces
> 2. save the pieces in seperate .RData files
> 3. Remove everything from the environment
> 4. load one of the piece
> 5. perform the manipulations on it
> 6. save it and remove it from the environment
> 7. Redo 4-6 for every piece
> 8. Merge everything together at the end
>
> It works if coded line by line but since I'll have to perform these tasks
> on other data sets, I am trying to automate this as much as I can.

The trackObjs package is designed to make it easy to work in approximately this manner -- it saves objects automatically to disk but they are still accessible as normal.

Here's how you could do the above - this example works with 10 8Mb objects in a R session with a limit of 40Mb.

# allow R only 40Mb of vector memory
mem.limits(vsize=40e6)
mem.limits()/1e6
library(trackObjs)
# start tracking to store data objects in the directory 'data' # each object is 8Mb, and we store 10 of them track.start("data")
n <- 10
m <- 1e6
constructObject <- function(i) i+rnorm(m) # steps 1, 2 & 3:
for (i in 1:n) {

    xname <- paste("x", i, sep="")
    cat("", xname)
    assign(xname, constructObject(i))
    # store in a file, accessible by name:     track(list=xname)
}
cat("\n")
gc(TRUE)
# accessing object by name
object.size(x1)/2^20 # In Mb

mean(x1)
mean(x2)
gc(TRUE)

# steps 4:6
# accessing object through a constructed name result <- sapply(1:n, function(i) mean(get(paste("x", i, sep="")))) result
# remove the data objects
track.remove(list=paste("x", 1:n, sep="")) track.stop()

Here's the a full transcript of the above - note how whenever gc() is called there is hardly any vector memory in use.

 > # allow R only 40Mb of vector memory
 > mem.limits(vsize=40e6)

    nsize vsize

       NA 40000000
 > mem.limits()/1e6
nsize vsize

    NA 40

 > library(trackObjs)
 > # start tracking to store data objects in the directory 'data'
 > # each object is 8Mb, and we store 10 of them
 > track.start("data")
 > n <- 10
 > m <- 1e6
 > constructObject <- function(i) i+rnorm(m)
 > # steps 1, 2 & 3:
 > for (i in 1:n) {
+    xname <- paste("x", i, sep="")
+    cat("", xname)
+    assign(xname, constructObject(i))
+    # store in a file, accessible by name:
+ track(list=xname)
+ }
  x1 x2 x3 x4 x5 x6 x7 x8 x9 x10> cat("\n")

 > gc(TRUE)
Garbage collection 19 = 6+0+13 (level 2) ... 4.0 Mbytes of cons cells used (42%)
0.7 Mbytes of vectors used (5%)

          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 148362  4.0     350000  9.4         NA   350000  9.4
Vcells  89973  0.7    1950935 14.9       38.2  2112735 16.2
 > # accessing object by name
 > object.size(x1)/2^20 # In Mb
[1] 7.629417
 > mean(x1)
[1] 0.998635
 > mean(x2)
[1] 1.999656
 > gc(TRUE)
Garbage collection 22 = 7+1+14 (level 2) ... 4.0 Mbytes of cons cells used (43%)
0.7 Mbytes of vectors used (6%)
          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 149264  4.0     350000  9.4         NA   350000  9.4
Vcells  90160  0.7    1560747 12.0       38.2  2112735 16.2
 > # steps 4:6

 > result <- sapply(1:n, function(i) mean(get(paste("x", i, sep=""))))  > result
  [1] 0.998635 1.999656 2.997368 4.000197 5.000159 6.001216 6.999552   [8] 7.999743 8.999982 10.001355
 > # remove the data objects
 > track.remove(list=paste("x", 1:n, sep=""))   [1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9" "x10"  > track.stop()
 >

>
> I am using a loop in which I used 'assign' and 'get' (pseudo code below).
> My problem is when I use 'get', it prints the whole object on the screen.
> I am wondering whether there is a more efficient way to do what I need to
> do. Any help would be appreciated. Please keep in mind that the whole
> process is quite computer-intensive, so I can't keep everything in the
> environment while R performs calculations.
>
> Say I have 1 big dataframe called data. I use 'split' to divide it into a
> list of 12 dataframes (call this list my.list)
>
> my.fun is a function that takes a dataframe, performs several
> manipulations on it and returns a dataframe.
>
>
> for (i in 1:12){
> assign( paste( "data", i, sep=""), my.fun(my.list[i])) # this works
> # now I need to save this new object as a RData.
>
> # The following line does not work
> save(paste("data", i, sep = ""), file = paste( paste("data", i, sep =
> ""), "RData", sep="."))
> }
>
> # This works but it is a bit convoluted!!!
> temp <- get(paste("data", i, sep = ""))
> save(temp, file = "lala.RData")
> }
>
>
> I am *sure* there is something more clever to do but I can't find it. Any
> help would be appreciated.
>
> best regards,
>
> MP
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 22 Dec 2007 - 00:57:55 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 22 Dec 2007 - 02:30:20 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.