Re: [R] Suggestion for big files [was: Re: A comment about R:]

From: François Pinard <>
Date: Mon 09 Jan 2006 - 05:47:07 EST

[hadley wickham]

>[François Pinard]

>> Selecting a sample is easy. Yet, I'm not aware of any SQL device for
>> easily selecting a _random_ sample of the records of a given table.
>> On the other hand, I'm no SQL specialist, others might know better.

>There are a number of such devices, which tend to be rather SQL variant
>specific. Try googling for select random rows mysql, select random
>rows pgsql, etc.

Thanks as well for these hints. Googling around as your suggested (yet keeping my eyes in the MySQL direction, because this is what we use), getting MySQL itself to do the selection is a bit discouraging, as according to comments I've read, MySQL does not seem to scale well with the database size according to the comments I've read, especially when records have to be decorated with random numbers and later sorted.

Yet, I did not drive any benchmark myself, and would not blindly take everything I read for granted, given that MySQL developers have speed in mind, and there are ways to interrupt a sort before running it to full completion, when only a few sorted records are wanted.

>Another possibility is to generate a large table of randomly
>distributed ids and then use that (with randomly generated limits) to
>select the appropriate number of records.

I'm not sure I understand your idea (what mixes me in the "randomly generated limits" part). If the "large table" is much larger than the size of the wanted sample, we might not be gaining much.

Just for fun: here, "sample(100000000, 10)" in R is slowish already :-).

All in all, if I ever have such a problem, a practical solution probably has to be outside of R, and maybe outside SQL as well.

François Pinard

______________________________________________ mailing list
PLEASE do read the posting guide!
Received on Mon Jan 09 22:22:28 2006

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:41:57 EST