Re: [R] Suggestion for big files [was: Re: A comment about R:]

From: hadley wickham <>
Date: Mon 09 Jan 2006 - 08:42:37 EST

> Thanks as well for these hints. Googling around as your suggested (yet
> keeping my eyes in the MySQL direction, because this is what we use),
> getting MySQL itself to do the selection is a bit discouraging, as
> according to comments I've read, MySQL does not seem to scale well with
> the database size according to the comments I've read, especially when
> records have to be decorated with random numbers and later sorted.

With SQL there is always a way to do what you want quickly, but you need to think carefully about what operations are most common in your database. For example, the problem is much easier if you can assume that the rows are numbered sequentially from 1 to n. This could be enfored using a trigger whenever a record is added/deleted. This would slow insertions/deletions but speed selects.

> Just for fun: here, "sample(100000000, 10)" in R is slowish already :-).

This is another example where greater knowledge of problem can yield speed increases. Here (where the number of selections is much smaller than the total number of objects) you are better off generating 10 numbers with runif(10, 0, 1000000) and then checking that they are unique

> >Another possibility is to generate a large table of randomly
> >distributed ids and then use that (with randomly generated limits) to
> >select the appropriate number of records.
> I'm not sure I understand your idea (what mixes me in the "randomly
> generated limits" part). If the "large table" is much larger than the
> size of the wanted sample, we might not be gaining much.

Think about using a table of random numbers. They are pregenerated for you, you just choose a starting and ending index. It will be slow to generate the table the first time, but then it will be fast. It will also take up quite a bit of space, but space is cheap (and time is not!)

Hadley mailing list PLEASE do read the posting guide! Received on Mon Jan 09 08:50:22 2006

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:41:57 EST