Re: [Rd] portable parallel seeds project: request for critiques

From: Paul Johnson <pauljohn32_at_gmail.com>
Date: Wed, 22 Feb 2012 12:17:25 -0600

Greetings. Answers below.

On Tue, Feb 21, 2012 at 7:04 AM, Petr Savicky <savicky_at_cs.cas.cz> wrote:

>
> Hi.
>
> In the description of your project in the file
>
> http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/README
>
> you argue as follows
>
> Question: Why is this better than the simple old approach of
> setting the seeds within each run with a formula like
>
> set.seed(2345 + 10 * run)
>
> Answer: That does allow replication, but it does not assure
> that each run uses non-overlapping random number streams. It
> offers absolutely no assurance whatsoever that the runs are
> actually non-redundant.
>
> The following demonstrates that the function set.seed() for
> the default generator indeed allows to have correlated streams.
>
> step <- function(x)
> {
> x[x < 0] <- x[x < 0] + 2^32
> x <- (69069 * x + 1) %% 2^32
> x[x > 2^31] <- x[x > 2^31] - 2^32
> x
> }
>
> n <- 1000
> seed1 <- 124370417 # any seed
> seed2 <- step(seed1)
>
> set.seed(seed1)
> x <- runif(n)
> set.seed(seed2)
> y <- runif(n)
>
> rbind(seed1, seed2)
> table(x[-1] == y[-n])
>
> The output is
>
> [,1]
> seed1 124370417
> seed2 205739774
>
> FALSE TRUE
> 5 994
>
> This means that if the streams x, y are generated from the two
> seeds above, then y is almost exactly equal to x shifted by 1.
>
> What is the current state of your project?
>
> Petr Savicky.

Thanks for the example. That is exactly the point I've been making about the old, ordinary way of setting seeds.

I continue testing on my plan, I believe it does work. The example implementation is successful. I have not received any critiques that make me think it is theoretically wrong, but I also have not received feedback from any body who has very detailed knowledge of R's parallel package to tell me that my approach is good.

In order for this to be easy for users, I need to put the init streams and set current stream functions into a package, and then streamline the process of creating the seed array. My opinion is that CRAN is now overflowed with too many "one function" packages, I don't want to create another one just for these two little functions, and I may roll it into my general purpose regression package called "rockchalk".

If I worked in a company and had money riding on this, I'd need to hire someone from R Core to stare at the way I'm setting, saving, and re-setting the .Random.seed variable to be completely sure there are not complications. Because parallel package is so new, there is relatively little documentation on how it is supposed to work. I'm just gazing at the source code and trying not to break anything it does, and trying to stop it from breaking anything I do.

One technical issue that has been raised to me is that R parallel's implementation of the L'Ecuyer generator is based on integer valued variables, whereas the original L'Ecuyer code uses double real variables. But I'm trusting the R Core on this, if they code the generator in a way that is good enough for R itself, it is good enough for me. (And if that's wrong, we'll all find out together :) ).

Josef Leydold (the rstream package author) has argued that R's implementation runs more slowly than it ought to. We had some correspondence and I tracked a few threads in forums. It appears the approach suggested there is roadblocked by some characteristics deep down in R and the way random streams are managed. Packages have only a limited, tenuous method to replace R's generators with their own generators. Here is the comment that L'Ecuyer and Leydold make in the document that is distributed with rstream.. "rstream: Streams of Random Numbers for Stochastic Simulation"

"There is one drawback in the implementation. R uses a static table to implement access to different
kinds of random number generators. Only one slot (RNGkind(kind="user-supplied")) has been provided for users who want to add her own generator. This slot relies on a pointer to a function
called user_unif_rand. Thus rstream has to abuse this slot to add additional random number
generators. However, there are other packages on CRAN that use the same trick (randaes, rlecuyer,
rsprng, SuppDists) or load one of these packages (Rmpi, rpvm, rpart, snow, varSelRF). Thus if a
user loads two of these packages or adds her own uniform random number generator, the behavior
of at least one of the packages is broken. This problem could be fixed by some changes in the R
core system, either by adding a package variableto RNGkind(kind="user-supplied") similar to the .Call function, or by replacing the static table by a dynamic one.

I interpret all of this to mean that, although other approaches may work as well or better,
the only approach that can "defend itself " in the ecology of a running R program is an approach
that fiddles with R's own generators, rather than tries to replace them. And that's what I'm doing.

pj

ps

Just this moment, while googling for Leydold's comments, I found a post I had not seen before. This is neat.

Parallel Random Number Generation in C++ and R Using RngStream Andrew Karl Randy Eubank Dennis Young http://math.la.asu.edu/~eubank/webpage/rngStreamPaper.pdf

I think that faces the same problem L'Ecuyer and Leydold described. The package may hang its generator on one pointer inside the R random scheme, but there's no way to be sure it stays there.

-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Wed 22 Feb 2012 - 18:19:53 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 23 Feb 2012 - 05:50:19 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive