Re: [Rd] portable parallel seeds project: request for critiques

From: Paul Johnson <pauljohn32_at_gmail.com>
Date: Fri, 17 Feb 2012 15:44:04 -0600

On Fri, Feb 17, 2012 at 3:23 PM, Paul Gilbert <pgilbert902_at_gmail.com> wrote:
> Paul
>
> I think (perhaps incorrectly) of the general problem being that one wants to
> run a random experiment, on a single node, or two nodes, or ten nodes, or
> any number of nodes, and reliably be able to reproduce the experiment
> without concern about how many nodes it runs on when you re-run it.
>
> From your description I don't have the impression your solution would do
> that. Am I misunderstanding?
>

Well, I think my approach does that! Each time a function runs, it grabs a pre-specified set of seed values and initializes the R .Random.seed appropriately.

Since I take the pre-specified seeds from the L'Ecuyer et al approach (cite below), I believe that
means each separate stream is dependably uncorrelated and non-overlapping, both within a particular run and across runs.

> A second problem is that you want to use a proven algorithm for generating
> the numbers. This is implicitly solved by the above, because you always get
> the same result as you do on one node with a well proven RNG. If you
> generate a string of seed and then numbers from those, do you have a proven
> RNG?
>
Luckily, I think that part was solved by people other than me:

L'Ecuyer, P., Simard, R., Chen, E. J. and Kelton, W. D. (2002) An object-oriented random-number package with many long streams and substreams. Operations Research 50 1073–5. http://www.iro.umontreal.ca/~lecuyer/myftp/papers/streams00.pdf

> Paul
>
>
> On 12-02-17 03:57 PM, Paul Johnson wrote:
>>
>> I've got another edition of my simulation replication framework.  I'm
>> attaching 2 R files and pasting in the readme.
>>
>> I would especially like to know if I'm doing anything that breaks
>> .Random.seed or other things that R's parallel uses in the
>> environment.
>>
>> In case you don't want to wrestle with attachments, the same files are
>> online in our SVN
>>
>>
>> http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/
>>
>>
>> ## Paul E. Johnson CRMDA<pauljohn_at_ku.edu>
>> ## Portable Parallel Seeds Project.
>> ## 2012-02-18
>>
>> Portable Parallel Seeds Project
>>
>> This is how I'm going to recommend we work with random number seeds in
>> simulations. It enhances work that requires runs with random numbers,
>> whether runs are in a cluster computing environment or in a single
>> workstation.
>>
>> It is a solution for two separate problems.
>>
>> Problem 1. I scripted up 1000 R runs and need high quality,
>> unique, replicable random streams for each one. Each simulation
>> runs separately, but I need to be confident their streams are
>> not correlated or overlapping. For replication, I need to be able to
>> select any run, say 667, and restart it exactly as it was.
>>
>> Problem 2. I've written a Parallel MPI (Message Passing Interface)
>> routine that launches 1000 runs and I need to assure each has
>> a unique, replicatable, random stream. I need to be able to
>> select any run, say 667, and restart it exactly as it was.
>>
>> This project develops one approach to create replicable simulations.
>> It blends ideas about seed management from John M. Chambers
>> Software for Data Analysis (2008) with ideas from the snowFT
>> package by Hana Sevcikova and Tony R. Rossini.
>>
>>
>> Here's my proposal.
>>
>> 1. Run a preliminary program to generate an array of seeds
>>
>> run1:   seed1.1   seed1.2   seed1.3
>> run2:   seed2.1   seed2.2   seed2.3
>> run3:   seed3.1   seed3.2   seed3.3
>> ...      ...       ...
>> run1000   seed1000.1  seed1000.2   seed1000.3
>>
>> This example provides 3 separate streams of random numbers within each
>> run. Because we will use the L'Ecuyer "many separate streams"
>> approach, we are confident that there is no correlation or overlap
>> between any of the runs.
>>
>> The projSeeds has to have one row per project, but it is not a huge
>> file. I created seeds for 2000 runs of a project that requires 2 seeds
>> per run.  The saved size of the file 104443kb, which is very small. By
>> comparison, a 1400x1050 jpg image would usually be twice that size.
>> If you save 10,000 runs-worth of seeds, the size rises to 521,993kb,
>> still pretty small.
>>
>> Because the seeds are saved in a file, we are sure each
>> run can be replicated. We just have to teach each program
>> how to use the seeds. That is step two.
>>
>>
>> 2. Inside each run, an initialization function runs that loads the
>> seeds file and takes the row of seeds that it needs.  As the
>> simulation progresses, the user can ask for random numbers from the
>> separate streams. When we need random draws from a particular stream,
>> we set the variable "currentStream" with the function useStream().
>>
>> The function initSeedStreams creates several objects in
>> the global environment. It sets the integer currentStream,
>> as well as two list objects, startSeeds and currentSeeds.
>> At the outset of the run, startSeeds and currentSeeds
>> are the same thing. When we change the currentStream
>> to a different stream, the currentSeeds vector is
>> updated to remember where that stream was when we stopped
>> drawing numbers from it.
>>
>>
>> Now, for the proof of concept. A working example.
>>
>> Step 1. Create the Seeds. Review the R program
>>
>> seedCreator.R
>>
>> That creates the file "projSeeds.rda".
>>
>>
>> Step 2. Use one row of seeds per run.
>>
>> Please review "controlledSeeds.R" to see an example usage
>> that I've tested on a cluster.
>>
>> "controlledSeeds.R" can also be run on a single workstation for
>> testing purposes.  There is a variable "runningInMPI" which determines
>> whether the code is supposed to run on the RMPI cluster or just in a
>> single workstation.
>>
>>
>> The code for each run of the model begins by loading the
>> required libraries and loading the seed file, if it exists, or
>> generating a new "projSeed" object if it is not found.
>>
>> library(parallel)
>> RNGkind("L'Ecuyer-CMRG")
>> set.seed(234234)
>> if (file.exists("projSeeds.rda")) {
>>   load("projSeeds.rda")
>> } else {
>>   source("seedCreator.R")
>> }
>>
>> ## Suppose the "run" number is:
>> run<- 232
>> initSeedStreams(run)
>>
>> After that, R's random generator functions will draw values
>> from the first random random stream that was initialized
>> in projSeeds. When each repetition (run) occurs,
>> R looks up the right seed for that run, and uses it.
>>
>> If the user wants to begin drawing observations from the
>> second random stream, this command is used:
>>
>> useStream(2)
>>
>> If the user has drawn values from stream 1 already, but
>> wishes to begin again at the initial point in that stream,
>> use this command
>>
>> useStream(1, origin = TRUE)
>>
>>
>> Question: Why is this approach better for parallel runs?
>>
>> Answer: After a batch of simulations, we can re-start any
>> one of them and repeat it exactly. This builds on the idea
>> of the snowFT package, by Hana Sevcikova and A.J. Rossini.
>>
>> That is different from the default approach of most R parallel
>> designs, including R's own parallel, RMPI and snow.
>>
>> The ordinary way of controlling seeds in R parallel would initialize
>> the 50 nodes, and we would lose control over seeds because runs would
>> be repeatedly assigned to nodes. The aim here is to make sure that
>> each particular run has a known starting point. After a batch of
>> 10,000 runs, we can look and say "something funny happened on run
>> 1,323" and then we can bring that back to life later, easily.
>>
>>
>>
>> Question: Why is this better than the simple old approach of
>> setting the seeds within each run with a formula like
>>
>> set.seed(2345 + 10 * run)
>>
>> Answer: That does allow replication, but it does not assure
>> that each run uses non-overlapping random number streams. It
>> offers absolutely no assurance whatsoever that the runs are
>> actually non-redundant.
>>
>> Nevertheless, it is a method that is widely used and recommended
>> by some visible HOWTO guides.
>>
>>
>>
>> Citations
>>
>> Hana Sevcikova and A. J. Rossini (2010). snowFT: Fault Tolerant
>>  Simple Network of Workstations. R package version 1.2-0.
>>  http://CRAN.R-project.org/package=snowFT
>>
>> John M Chambers (2008). SoDA: Functions and Exampels for "Software
>>   for Data Analysis". R package version 1.0-3.
>>
>> John M Chambers (2008) Software for Data Analysis. Springer.
>>
>>
>>
>>
>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Fri 17 Feb 2012 - 21:46:17 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 17 Feb 2012 - 23:10:18 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive