Re: [Rd] Change in the RNG implementation?

From: Martin Maechler <maechler_at_stat.math.ethz.ch>
Date: Sat, 20 Oct 2012 21:50:41 +0200

>>>>> Duncan Murdoch <murdoch.duncan_at_gmail.com> >>>>> on Fri, 19 Oct 2012 19:26:39 -0400 writes:

    > On 12-10-19 7:04 PM, Hervé Pagès wrote:

>> Hi,
>>
>> Looks like the implementation of random number generation changed in
>> R-devel with respect to R-2.15.1.
>>
>> With R-2.15.1:
>>
>> > set.seed(33)
>> > sample(49821115, 10)
>> [1] 22217252 19661919 24099911 45779422 42043111 25774933 21778053
>> 17098516
>> [9] 773073 5878451
>>
>> With recent R-devel:
>>
>> > set.seed(33)
>> > sample(49821115, 10)
>> [1] 22217252 19661919 24099912 45779425 42043115 25774935 21778056
>> 17098518
>> [9] 773073 5878452
>>
>> This is on a 64-bit Ubuntu system.
>>
>> Is this change intended? I didn't see anything in the NEWS file.
>>
>> A potential problem with this is that it will break unit tests
>> for algorithms that make use of RNG.
>>
>> Another more practical problem (at least for me) is the following:
>> Bioconductor package maintainers are sometimes working hard on the
>> development version of their package to improve the performance of
>> some key functions. Comparing performance between BioC release
>> (based on R-2.15) and devel (based on R-devel) often requires big
>> input data that is randomly generated, because it's easiest than
>> working with real data. Typically a small script is written that
>> takes care of loading the required packages, generating the input
>> data, and running a simple analysis. The same script is sourced in
>> R-2.15 and R-devel, and performance and results are compared.
>>
>> Not being able to generate exactly the same input in the script is
>> a problem. It can be worked around by generating the input once,
>> serializing it, and use load() in the script, but that makes things
>> more complicated and the script is not a standalone script anymore
>> (cannot be passed around without also passing around the big .rda
>> file).
>>
>> Thanks,
>> H.
>>

    > I think it was mentioned in the NEWS:

    > \code{sample.int()} has some support for \eqn{n \ge     > 2^{31}}{n >= 2^31}: see its help for the limitations.

    > A different algorithm is used for \code{(n, size, replace = FALSE,
    > prob = NULL)} for \code{n > 1e7} and \code{size <= n/2}.  This
    > is much faster and uses less memory, but does give different results.

So, to iterate : The RNG has not been changed at all, but sample() has, for extreme cases (large n) like yours.

    > I don't think the old algorithm is available, but perhaps it could be     > made available by an optional parameter.

I do think we should ideally add such an option or probably rather allow the more thorough way of either using RNGversion(..) or something similar to set sample()'s behavior to exactly as previously.
Doing "globally" is really needed, as sample() maybe called from a function (from a function from a function) that is not in the programmer's hand, and so the programmeR could not even set the new optional argument if he found out that he had to.

Honestly, I'm surprised Hervé found a real case where the difference is visible.

Martin

    > Duncan Murdoch



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat 20 Oct 2012 - 19:53:28 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 22 Oct 2012 - 17:10:50 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive