[Rd] portable parallel seeds project: request for critiques

From: Paul Johnson <pauljohn32_at_gmail.com>
Date: Fri, 17 Feb 2012 14:57:26 -0600


I've got another edition of my simulation replication framework. I'm attaching 2 R files and pasting in the readme.

I would especially like to know if I'm doing anything that breaks .Random.seed or other things that R's parallel uses in the environment.

In case you don't want to wrestle with attachments, the same files are online in our SVN

http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/

## Paul E. Johnson CRMDA <pauljohn_at_ku.edu>
## Portable Parallel Seeds Project.
## 2012-02-18

Portable Parallel Seeds Project

This is how I'm going to recommend we work with random number seeds in simulations. It enhances work that requires runs with random numbers, whether runs are in a cluster computing environment or in a single workstation.

It is a solution for two separate problems.

Problem 1. I scripted up 1000 R runs and need high quality, unique, replicable random streams for each one. Each simulation runs separately, but I need to be confident their streams are not correlated or overlapping. For replication, I need to be able to select any run, say 667, and restart it exactly as it was.

Problem 2. I've written a Parallel MPI (Message Passing Interface) routine that launches 1000 runs and I need to assure each has a unique, replicatable, random stream. I need to be able to select any run, say 667, and restart it exactly as it was.

This project develops one approach to create replicable simulations. It blends ideas about seed management from John M. Chambers Software for Data Analysis (2008) with ideas from the snowFT package by Hana Sevcikova and Tony R. Rossini.

Here's my proposal.

  1. Run a preliminary program to generate an array of seeds
run1:   seed1.1   seed1.2   seed1.3
run2:   seed2.1   seed2.2   seed2.3
run3:   seed3.1   seed3.2   seed3.3
...      ...       ...

run1000 seed1000.1 seed1000.2 seed1000.3

This example provides 3 separate streams of random numbers within each run. Because we will use the L'Ecuyer "many separate streams" approach, we are confident that there is no correlation or overlap between any of the runs.

The projSeeds has to have one row per project, but it is not a huge file. I created seeds for 2000 runs of a project that requires 2 seeds per run. The saved size of the file 104443kb, which is very small. By comparison, a 1400x1050 jpg image would usually be twice that size. If you save 10,000 runs-worth of seeds, the size rises to 521,993kb, still pretty small.

Because the seeds are saved in a file, we are sure each run can be replicated. We just have to teach each program how to use the seeds. That is step two.

2. Inside each run, an initialization function runs that loads the seeds file and takes the row of seeds that it needs. As the simulation progresses, the user can ask for random numbers from the separate streams. When we need random draws from a particular stream, we set the variable "currentStream" with the function useStream().

The function initSeedStreams creates several objects in the global environment. It sets the integer currentStream, as well as two list objects, startSeeds and currentSeeds. At the outset of the run, startSeeds and currentSeeds are the same thing. When we change the currentStream to a different stream, the currentSeeds vector is updated to remember where that stream was when we stopped drawing numbers from it.

Now, for the proof of concept. A working example.

Step 1. Create the Seeds. Review the R program

seedCreator.R

That creates the file "projSeeds.rda".

Step 2. Use one row of seeds per run.

Please review "controlledSeeds.R" to see an example usage that I've tested on a cluster.

"controlledSeeds.R" can also be run on a single workstation for testing purposes. There is a variable "runningInMPI" which determines whether the code is supposed to run on the RMPI cluster or just in a single workstation.

The code for each run of the model begins by loading the required libraries and loading the seed file, if it exists, or generating a new "projSeed" object if it is not found.

library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(234234)
if (file.exists("projSeeds.rda")) {
  load("projSeeds.rda")
} else {
  source("seedCreator.R")
}

## Suppose the "run" number is:

run <- 232
initSeedStreams(run)

After that, R's random generator functions will draw values from the first random random stream that was initialized in projSeeds. When each repetition (run) occurs, R looks up the right seed for that run, and uses it.

If the user wants to begin drawing observations from the second random stream, this command is used:

useStream(2)

If the user has drawn values from stream 1 already, but wishes to begin again at the initial point in that stream, use this command

useStream(1, origin = TRUE)

Question: Why is this approach better for parallel runs?

Answer: After a batch of simulations, we can re-start any one of them and repeat it exactly. This builds on the idea of the snowFT package, by Hana Sevcikova and A.J. Rossini.

That is different from the default approach of most R parallel designs, including R's own parallel, RMPI and snow.

The ordinary way of controlling seeds in R parallel would initialize the 50 nodes, and we would lose control over seeds because runs would be repeatedly assigned to nodes. The aim here is to make sure that each particular run has a known starting point. After a batch of 10,000 runs, we can look and say "something funny happened on run 1,323" and then we can bring that back to life later, easily.

Question: Why is this better than the simple old approach of setting the seeds within each run with a formula like

set.seed(2345 + 10 * run)

Answer: That does allow replication, but it does not assure that each run uses non-overlapping random number streams. It offers absolutely no assurance whatsoever that the runs are actually non-redundant.

Nevertheless, it is a method that is widely used and recommended by some visible HOWTO guides.

Citations

Hana Sevcikova and A. J. Rossini (2010). snowFT: Fault Tolerant  Simple Network of Workstations. R package version 1.2-0.  http://CRAN.R-project.org/package=snowFT

John M Chambers (2008). SoDA: Functions and Exampels for "Software   for Data Analysis". R package version 1.0-3.

John M Chambers (2008) Software for Data Analysis. Springer.

-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas

______________________________________________ R-devel_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel

Received on Fri 17 Feb 2012 - 20:59:56 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 21 Feb 2012 - 13:40:19 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive