Re: [R] large data set, error: cannot allocate vector

From: Jason Barnhart <jasoncbarnhart_at_msn.com>
Date: Wed 10 May 2006 - 04:32:30 EST

  1. So the original problem remains unsolved? You can load data but lack memory to do more (or so it appears). It seems to me that your options are:
  2. ensure that the --max-mem-size option is allowing R to utilize all available RAM
  3. sample if possible, i.e. are 20MM necessary
  4. load in matrices or vectors, then "process" or analyze
  5. load data in database that R connects to, use that engine for processing
  6. drop unnecessary columns from data.frame
  7. analyze subsets of the data (variable-wise--review fewer vars at a time)
  8. buy more RAM (32 vs 64 bit architecture should not be the issue, since you use LINUX)
  9. ???
  10. Not finding memory.limit() is very odd. You should consider reviewing the bug reporting process to determine if this should be reported. Here's an example of my output. > memory.limit() [1] 1782579200
  11. This may not be the correct way to look at the timing differences you experienced. However, it seems R is holding up well.
                    10MM  100MM  ratio-100MM/10MM
           cat      0.04   7.60  190.00
          scan      9.93  92.27    9.29

ratio scan/cat 248.25 12.14

Please let me know how you resolve. I'm curious about your solution HTH,
-jason

>
> On May 5, 2006, at 6:48 PM, Jason Barnhart wrote:
>> Please try memory.limit() to confirm how much system memory is available
>> to R.
>
> Unfortunately, memory.limit() is not available:
>
> R > memory.limit()
> Error: could not find function "memory.limit"
>
> Did you mean mem.limits()?
>
> R > mem.limits()
> nsize vsize
> NA NA
>
>> Additionally, read.delim returns a data.frame. You could use the
>> colClasses
>> argument to change variable types (see example below) or use scan()
>> which
>> returns a vector. This would store the data more compactly. The vector
>> object is significantly smaller than the data.frame.
>>
>> It appears from your example session that you are examining a single
>> variable. If so, a vector would suffice.
>
> Yes, a vector worked very nicely (see below.) In fact, using the vector
> method R was able to read in the 10 MM entry data set much faster than a
> data.frame.
>
> The reason I have stayed with data.frames is because my "real" data is of
> a mixed type, much like a database table or spreadsheet. Unfortunately,
> my real data set takes too long to work with (~20 MM entries of mixed
> type which requires over 20 minutes just to load the data into R.) In
> contrast, the toy data set is about the same number of entries, but only
> a single column, which captures some of the essence of my real data set
> but is a lot faster and easier to work with.
>
>> Note in the example below, processing large numbers in the integer type
>> creates an under/over flow error.
>
> Thanks for the examples. They really help.
>
> Here's a sample transcript from a bash shell under Linux comparing some
> timings using a vector within R:
>
> $ uname -sorv ; rpm -q R ; R --version
> Linux 2.6.16-1.2096_FC4smp #1 SMP Wed Apr 19 15:51:25 EDT 2006 GNU/Linux
> R-2.3.0-2.fc4
> R version 2.3.0 (2006-04-24)
> Copyright (C) 2006 R Development Core Team
>
> $ time -p cat dataset.010MM.txt > /dev/null
> real 0.04
> user 0.00
> sys 0.03
>
> $ time -p cat dataset.100MM.txt > /dev/null
> real 7.60
> user 0.06
> sys 0.67
>
> $ time -p wc -l dataset.100MM.txt
> 100000000 dataset.100MM.txt
> real 2.38
> user 1.92
> sys 0.44
>
> $ echo 'foov <- scan("dataset.010MM.txt") ; length(foov)' \
> | time -p R -q --no-save
>
> R > foov <- scan("dataset.010MM.txt") ; length(foov)
> Read 10000000 items
> [1] 10000000
>
> real 9.93
> user 9.41
> sys 0.52
>
> $ echo 'foov <- scan("dataset.100MM.txt") ; length(foov) ' \
> | time -p R -q --no-save
>
> R > foov <- scan("dataset.100MM.txt") ; length(foov)
> Read 100000000 items
> [1] 100000000
>
> real 92.27
> user 88.66
> sys 3.58
>
> Regards,
> - Robert
> http://www.cwelug.org/downloads
> Help others get OpenSource software. Distribute FLOSS
> for Windows, Linux, *BSD, and MacOS X with BitTorrent
>
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Wed May 10 04:36:47 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Wed 10 May 2006 - 08:10:04 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.