Re: [R] Suggestion for big files [was: Re: A comment about R:]

From: François Pinard <pinard_at_iro.umontreal.ca>
Date: Mon 09 Jan 2006 - 09:51:23 EST

[hadley wickham]

>> [...] according to comments I've read, MySQL does not seem to scale
>> well with the database size according to the comments I've read,
>> especially when records have to be decorated with random numbers and
>> later sorted.

>With SQL there is always a way to do what you want quickly, but you
>need to think carefully about what operations are most common in your
>database. For example, the problem is much easier if you can assume
>that the rows are numbered sequentially from 1 to n. This could be
>enfored using a trigger whenever a record is added/deleted. This would
>slow insertions/deletions but speed selects.

Sure (for a caricature example) that if database records are already decorated with random numbers, and an index is built over the decoration, random sampling may indeed be done quicker :-). The fact is that (at least our) databases are not especially designed for random sampling, and people in charge would resist redesigning them merely because there would be a few needs for random sampling.

What would be ideal is being able to build random samples out of any big database or file, with equal ease. The fact is that it's doable. (Brian Ripley points out that R textual I/O has too much overhead for being usable, so one should rather say, sadly: "It's doable outside R".)

>> Just for fun: here, "sample(100000000, 10)" in R is slowish already
>> :-).

>This is another example where greater knowledge of problem can yield
>speed increases. Here (where the number of selections is much smaller
>than the total number of objects) you are better off generating 10
>numbers with runif(10, 0, 1000000) and then checking that they are
>unique

Of course, my remark about "sample()" is related to the previous discussion. If "sample(N, M)" was more on the O(M) side than being on the O(N) side (both memory-wise and cpu-wise), it could be used for preselecting which rows of a big database to include in a random sample, so building on your idea of using a set of IDs. As the sample of M records will have to be processed in-memory by R anyway, computing a vector of M indices does not (or should not) increase complexity.

However, "sample(N, M)" is likely less usable for randomly sampling a database, if it is O(N) to start with. About your suggestion of using "runif" and later checking uniqueness, "sample()" could well be implemented this way, when the arguments are proper. The "greater knowledge of the problem" could be built in right into the routine meant to solve it. "sample(N, M)" could even know how to take advantage of some simplified case of a "reservoir sampling" technique :-).

>> >[...] a large table of randomly distributed ids [...] (with randomly
>> >generated limits) to select the appropriate number of records.

>[...] a table of random numbers [...] pregenerated for you, you just
>choose a starting and ending index. It will be slow to generate the
>table the first time, but then it will be fast. It will also take up
>quite a bit of space, but space is cheap (and time is not!)

Thanks for the explanation.

In the case under consideration here (random sampling of a big file or database), I would be tempted to guess that the time required for generating pseudo-random numbers is negligible when compared to the overall input/output time, so it might be that pregenerating randomized IDs is not worth the trouble. Also given that whenever the database size changes, the list of pregenerated IDs is not valid anymore.

-- 
François Pinard   http://pinard.progiciels-bpi.ca

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Mon Jan 09 10:00:51 2006

This archive was generated by hypermail 2.1.8 : Mon 09 Jan 2006 - 14:17:23 EST