Re: [R] read large amount of data

From: Thomas Lumley <tlumley_at_u.washington.edu>
Date: Tue 19 Jul 2005 - 01:53:56 EST

On Mon, 18 Jul 2005, Weiwei Shi wrote:

> Hi,
> I have a dataset with 2194651x135, in which all the numbers are 0,1,2,
> and is bar-delimited.
>
> I used the following approach which can handle 100,000 lines:
> t<-scan('fv', sep='|', nlines=100000)
> t1<-matrix(t, nrow=135, ncol=100000)
> t2<-t(t1)
> t3<-as.data.frame(t2)
>
> I changed my plan into using stratified sampling with replacement (col
> 2 is my class variable: 1 or 2). The class distr is like:
> awk -F\| '{print $2}' fv | sort | uniq -c
> 2162792 1
> 31859 2
>
> Is it possible to use R to read the whole dataset and do the
> stratified sampling? Is it really dependent on my memory size?

You may well not be able to read the whole data set into memory at once: it would take a bit more than 2Gb memory even to store it.

You can use readLines to read it in chunks of, say, 10000 lines.

To do stratified sampling I would suggest bernoulli sampling of slightly more than you want. Eg if you want 10000 from class 1, keeping each elements with probability 10500/2162792 will get you Poisson(10500) elements, which will be more than 10000 elements with better than 99.999% probability. You can then choose 10000 at random from these. I can't think of an approach that it is guaranteed to work in one pass over the data, but 99.999% is pretty close.

         -thomas



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Tue Jul 19 01:58:18 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:33:46 EST