Re: [R] Suggestion for big files [was: Re: A comment about R:]

From: <r.ghezzo_at_staff.mcgill.ca>
Date: Tue 10 Jan 2006 - 00:44:19 EST

I found Reservoir-Sampling algorithms of time complexity O(n(1+log(N/n))) by Kim-Hung Li , ACM Transactions on Mathematical Software Vol 20 No 4 Dec 94 p481-492.
He mentions algorithm Z and K and proposed 2 improved versions alg L and M. Algorith L is really easy to implement but relatively slow, M doesn't look very difficult and is the fastest.
Heberto Ghezzo
McGill University
Montreal - Canada

Quoting François Pinard <pinard@iro.umontreal.ca>:

> [Martin Maechler]
>
> > FrPi> Suppose the file (or tape) holds N records (N is not known
> > FrPi> in advance), from which we want a sample of M records at
> > FrPi> most. [...] If the algorithm is carefully designed, when
> > FrPi> the last (N'th) record of the file will have been processed
> > FrPi> this way, we may then have M records randomly selected from
> > FrPi> N records, in such a a way that each of the N records had an
> > FrPi> equal probability to end up in the selection of M records. I
> > FrPi> may seek out for details if needed.
>
> >[...] I'm also intrigued about the details of the algorithm you
> >outline above.
>
> I went into my old SPSS books and related references to find it for you,
> to no avail (yet I confess I did not try very hard). I vaguely remember
> it was related to Spearman's correlation computation: I did find notes
> about the "severe memory limitation" of this computation, but nothing
> about the implemented workaround. I did find other sampling devices,
> but not the very one I remember having read about, many years ago.
>
> On the other hand, Googling tells that this topic has been much studied,
> and that Vitter's algorithm Z seems to be popular nowadays (even if not
> the simplest) because it is more efficient than others. Google found
> a copy of the paper:
>
> http://www.cs.duke.edu/~jsv/Papers/Vit85.Reservoir.pdf
>
> Here is an implementation for Postgres:
>
> http://svr5.postgresql.org/pgsql-patches/2004-05/msg00319.php
>
> yet I do not find it very readable -- but this is only an opinion: I'm
> rather demanding in the area of legibility, while many or most people
> are more courageous than me! :-).
>
> --
> François Pinard http://pinard.progiciels-bpi.ca
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Tue Jan 10 00:52:54 2006

This archive was generated by hypermail 2.1.8 : Tue 10 Jan 2006 - 02:21:22 EST