Re: [Rd] Change in the RNG implementation?

From: Hervé Pagès <hpages_at_fhcrc.org>
Date: Sun, 21 Oct 2012 23:02:50 -0700

Hi Duncan, Martin,

Thanks for your answers.

For my real case I was generating millions of random positions on a genome.

I compared sample.int() performance between R-2.15.1 and R-devel, and, for me, it performs better in R-2.15.1 (almost 3x faster and also uses slightly less memory):

With R-2.15.1:

   > set.seed(33)

   > system.time(random_chrom_pos <- sample(199000666L, 95000777L))

      user  system elapsed
     4.964   0.268   5.242

   > gc()
              used  (Mb) gc trigger   (Mb)  max used   (Mb)
   Ncells   137285   7.4     350000   18.7    350000   18.7
   Vcells 47633785 363.5 154735917 1180.6 147135703 1122.6

   > sessionInfo()
   R version 2.15.1 (2012-06-22)
   Platform: x86_64-unknown-linux-gnu (64-bit)

   locale:

    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
    [7] LC_PAPER=C                 LC_NAME=C
    [9] LC_ADDRESS=C               LC_TELEPHONE=C
   [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

   attached base packages:
   [1] stats graphics grDevices utils datasets methods base

With R-devel:

   > set.seed(33)

   > system.time(random_chrom_pos <- sample(199000666L, 95000777L))

      user system elapsed
    14.532 0.296 14.854

   > gc()

              used  (Mb) gc trigger   (Mb)  max used   (Mb)
   Ncells   145525   7.8     350000   18.7    350000   18.7
   Vcells 47644082 363.5 152959996 1167.0 182023372 1388.8

   > sessionInfo()
   R Under development (unstable) (2012-10-02 r60861)    Platform: x86_64-unknown-linux-gnu (64-bit)

   locale:

    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
    [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
    [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
    [7] LC_PAPER=C                 LC_NAME=C
    [9] LC_ADDRESS=C               LC_TELEPHONE=C
   [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

   attached base packages:
   [1] stats graphics grDevices utils datasets methods base

FWIW my R-2.15.1 and R-devel were configured with --disable-byte-compiled-packages, otherwise, I use all the defaults. Also my system is a standard Ubuntu 12.04 installation with no fancy settings/tweakings/customizations.

Thanks,
H.

On 10/20/2012 12:50 PM, Martin Maechler wrote:
>>>>>> Duncan Murdoch <murdoch.duncan_at_gmail.com>
>>>>>> on Fri, 19 Oct 2012 19:26:39 -0400 writes:
>
> > On 12-10-19 7:04 PM, Hervé Pagès wrote:
> >> Hi,
> >>
> >> Looks like the implementation of random number generation changed in
> >> R-devel with respect to R-2.15.1.
> >>
> >> With R-2.15.1:
> >>
> >> > set.seed(33)
> >> > sample(49821115, 10)
> >> [1] 22217252 19661919 24099911 45779422 42043111 25774933 21778053
> >> 17098516
> >> [9] 773073 5878451
> >>
> >> With recent R-devel:
> >>
> >> > set.seed(33)
> >> > sample(49821115, 10)
> >> [1] 22217252 19661919 24099912 45779425 42043115 25774935 21778056
> >> 17098518
> >> [9] 773073 5878452
> >>
> >> This is on a 64-bit Ubuntu system.
> >>
> >> Is this change intended? I didn't see anything in the NEWS file.
> >>
> >> A potential problem with this is that it will break unit tests
> >> for algorithms that make use of RNG.
> >>
> >> Another more practical problem (at least for me) is the following:
> >> Bioconductor package maintainers are sometimes working hard on the
> >> development version of their package to improve the performance of
> >> some key functions. Comparing performance between BioC release
> >> (based on R-2.15) and devel (based on R-devel) often requires big
> >> input data that is randomly generated, because it's easiest than
> >> working with real data. Typically a small script is written that
> >> takes care of loading the required packages, generating the input
> >> data, and running a simple analysis. The same script is sourced in
> >> R-2.15 and R-devel, and performance and results are compared.
> >>
> >> Not being able to generate exactly the same input in the script is
> >> a problem. It can be worked around by generating the input once,
> >> serializing it, and use load() in the script, but that makes things
> >> more complicated and the script is not a standalone script anymore
> >> (cannot be passed around without also passing around the big .rda
> >> file).
> >>
> >> Thanks,
> >> H.
> >>
>
> > I think it was mentioned in the NEWS:
>
> > \code{sample.int()} has some support for \eqn{n \ge
> > 2^{31}}{n >= 2^31}: see its help for the limitations.
>
> > A different algorithm is used for \code{(n, size, replace = FALSE,
> > prob = NULL)} for \code{n > 1e7} and \code{size <= n/2}. This
> > is much faster and uses less memory, but does give different results.
>
> So, to iterate : The RNG has not been changed at all,
> but sample() has, for extreme cases (large n) like yours.
>
> > I don't think the old algorithm is available, but perhaps it could be
> > made available by an optional parameter.
>
> I do think we should ideally add such an option or probably
> rather allow the more thorough way of either using
> RNGversion(..) or something similar to set sample()'s behavior
> to exactly as previously.
> Doing "globally" is really needed, as sample() maybe called from a
> function (from a function from a function) that is not in the
> programmer's hand, and so the programmeR could not even
> set the new optional argument if he found out that he had to.
>
> Honestly, I'm surprised Hervé found a real case where the
> difference is visible.

>
> Martin
>
>
> > Duncan Murdoch
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages_at_fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Mon 22 Oct 2012 - 06:05:35 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 23 Oct 2012 - 11:20:48 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive