Re: [R] Suggestion for big files [was: Re: A comment about R:]

From: François Pinard <pinard_at_iro.umontreal.ca>
Date: Fri 06 Jan 2006 - 14:41:21 EST

[Brian Ripley]

>I rather thought that using a DBMS was standard practice in the
>R community for those using large datasets: it gets discussed rather
>often.

Indeed. (I tried RMySQL even before speaking of R to my co-workers.)

>Another possibility is to make use of the several DBMS interfaces already
>available for R. It is very easy to pull in a sample from one of those,
>and surely keeping such large data files as ASCII not good practice.

Selecting a sample is easy. Yet, I'm not aware of any SQL device for easily selecting a _random_ sample of the records of a given table. On the other hand, I'm no SQL specialist, others might know better.

We do not have a need yet for samples where I work, but if we ever need such, they will have to be random, or else, I will always fear biases.

>One problem with Francois Pinard's suggestion (the credit has got lost)
>is that R's I/O is not line-oriented but stream-oriented. So selecting
>lines is not particularly easy in R.

I understand that you mean random access to lines, instead of random selection of lines. Once again, this chat comes out of reading someone else's problem, this is not a problem I actually have. SPSS was not randomly accessing lines, as data files could well be hold on magnetic tapes, where random access is not possible on average practice. SPSS reads (or was reading) lines sequentially from beginning to end, and the _random_ sample is built while the reading goes.

Suppose the file (or tape) holds N records (N is not known in advance), from which we want a sample of M records at most. If N <= M, then we use the whole file, no sampling is possible nor necessary. Otherwise, we first initialise M records with the first M records of the file. Then, for each record in the file after the M'th, the algorithm has to decide if the record just read will be discarded or if it will replace one of the M records already saved, and in the latter case, which of those records will be replaced. If the algorithm is carefully designed, when the last (N'th) record of the file will have been processed this way, we may then have M records randomly selected from N records, in such a a way that each of the N records had an equal probability to end up in the selection of M records. I may seek out for details if needed.

This is my suggestion, or in fact, more a thought that a suggestion. It might represent something useful either for flat ASCII files or even for a stream of records coming out of a database, if those effectively do not offer ready random sampling devices.

P.S. - In the (rather unlikely, I admit) case the gang I'm part of would have the need described above, and if I then dared implementing it myself, would it be welcome?

-- 
François Pinard   http://pinard.progiciels-bpi.ca

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Fri Jan 06 14:50:21 2006

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:41:53 EST