Re: [R] read large amount of data

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Tue 19 Jul 2005 - 02:34:50 EST

On Mon, 18 Jul 2005, Thomas Lumley wrote:

> On Mon, 18 Jul 2005, Weiwei Shi wrote:
>
>> Hi,
>> I have a dataset with 2194651x135, in which all the numbers are 0,1,2,
>> and is bar-delimited.
>>
>> I used the following approach which can handle 100,000 lines:
>> t<-scan('fv', sep='|', nlines=100000)
>> t1<-matrix(t, nrow=135, ncol=100000)
>> t2<-t(t1)
>> t3<-as.data.frame(t2)
>>
>> I changed my plan into using stratified sampling with replacement (col
>> 2 is my class variable: 1 or 2). The class distr is like:
>> awk -F\| '{print $2}' fv | sort | uniq -c
>> 2162792 1
>> 31859 2
>>
>> Is it possible to use R to read the whole dataset and do the
>> stratified sampling? Is it really dependent on my memory size?
>
> You may well not be able to read the whole data set into memory at once:
> it would take a bit more than 2Gb memory even to store it.

About 1.2G if stored as an integer (not double) vector.

> You can use readLines to read it in chunks of, say, 10000 lines.
>
> To do stratified sampling I would suggest bernoulli sampling of slightly
> more than you want. Eg if you want 10000 from class 1, keeping each
> elements with probability 10500/2162792 will get you Poisson(10500)
> elements, which will be more than 10000 elements with better than 99.999%
> probability. You can then choose 10000 at random from these. I can't think
> of an approach that it is guaranteed to work in one pass over the data,
> but 99.999% is pretty close.

Reservoir sampling methods will work in one pass. See e.g. my 1987 book on Stochastic Simulation. But Thomas' idea will be easier to implement in R, and I would have chosen 20000 not 10500 and be sure I would get enough.

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Tue Jul 19 02:38:52 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:33:46 EST