Re: [R] large data set, error: cannot allocate vector

From: Jason Barnhart <jasoncbarnhart_at_msn.com>
Date: Sat 06 May 2006 - 09:48:35 EST

Please try memory.limit() to confirm how much system memory is available to R.

Additionally, read.delim returns a data.frame. You could use the colClasses argument to change variable types (see example below) or use scan() which returns a vector. This would store the data more compactly. The vector object is significantly smaller than the data.frame.

It appears from your example session that you are examining a single variable. If so, a vector would suffice.

Note in the example below, processing large numbers in the integer type creates an under/over flow error.

====================Begin Session====================================

> #create vector
> foovector<-scan(file="temp.txt")

Read 2490368 items
>
> #create data.frame
> foo<-read.delim(file="temp.txt",row.names=NULL,header=FALSE,colClasses=as.vector(c("numeric")))
> attributes(foo)$names<-"myfoo"
>
> foo2<-read.delim(file="temp.txt",row.names=NULL,header=FALSE,colClasses=as.vector(c("integer")))
> attributes(foo2)$names<-"myfoo"
>
> #vector from data.frame
> tmpfoo<-foo$myfoo
>
> #check size
> object.size(foo)
[1] 119538076
> object.size(foo2)

[1] 109576604
> object.size(foovector)

[1] 19922972
> object.size(tmpfoo)

[1] 19922972
>
> #check sums
> sum(tmpfoo)

[1] 2.498528e+13
> sum(foo$myfoo)

[1] 2.498528e+13
> sum(foo2$myfoo)

[1] NA
Warning message:
Integer overflow in sum(.); use sum(as.numeric(.))
> sum(foovector)

[1] 2.498528e+13
>
> #show type
> class(foo2$myfoo)

[1] "integer"
> class(foo$myfoo)

[1] "numeric"
> class(tmpfoo)

[1] "numeric"
> class(foovector)

[1] "numeric"
====================End Session====================================












>
> On May 5, 2006, at 11:30 AM, Thomas Lumley wrote:
>> In addition to Uwe's message it is worth pointing out that gc()
>> reports
>> the maximum memory that your program has used (the rightmost two
>> columns).
>> You will probably see that this is large.
>
> Reloading the 10 MM dataset:
>
> R > foo <- read.delim("dataset.010MM.txt")
>
> R > object.size(foo)
> [1] 440000376
>
> R > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 10183941 272.0 15023450 401.2 10194267 272.3
> Vcells 20073146 153.2 53554505 408.6 50086180 382.2
>
> Combined, Ncells or Vcells appear to take up about 700 MB of RAM,
> which is about 25% of the 3 GB available under Linux on 32-bit
> architecture. Also, removing foo seemed to free up "used" memory,
> but didn't change the "max used":
>
> R > rm(foo)
>
> R > gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 186694 5.0 12018759 321.0 10194457 272.3
> Vcells 74095 0.6 44173915 337.1 50085563 382.2
>
> Regards,
> - Robert
> http://www.cwelug.org/downloads
> Help others get OpenSource software. Distribute FLOSS
> for Windows, Linux, *BSD, and MacOS X with BitTorrent
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sat May 06 10:07:51 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Wed 10 May 2006 - 04:09:58 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.