[Rd] Example of "task seeds" with R parallel. Critique?

From: Paul Johnson <pauljohn32_at_gmail.com>
Date: Fri, 13 Jan 2012 16:46:01 -0600


Greetings:

In R parallel's vignette, there is a comment "It would however take only slightly more work to allocate a stream to each task." (p.6). I've written down a working example that can allocate not just one, but several separate seeds for each task. (We have just a few project here that need multiple streams). I would like to help work that up for inclusion in the parallel package, if possible.

This approach is not original. It combines the idea behind snowFT and ideas for setting and monitoring seeds to replicate random streams in John Chambers Software for Data Analysis, Section 6.10. I'm able to save a block of seeds in a file, run a simulation for each one, and then re-run any particular task and match the random numbers.

But I realize there are a lot of dangers I am not yet aware of, so I'm asking you what can go wrong? I see danger in conflicts between my effort to manage seeds and the work of parallel functions that try to manage seeds for me. That's why I wish I could integrate task-based seeds into parallel itself.

RNGkind("L'Ecuyer-CMRG")
set.seed(23456)

library(parallel)

## nrep = number of repetitions (or tasks)
## streamsPerRep = number of streams needed by each repetition
nReps <- 2000
streamsPerRep <- 2

## projSeeds=list of lists of stream seeds
projSeeds <- vector(mode="list", nReps)
for (i in 1:nReps) projSeeds[[i]] <- vector(mode="list", streamsPerRep)

runif(1) ##establishes .Random.seed
##Grab first seed

s <- .Random.seed
origSeed <- s

x <- rnorm(4) ##will compare later
x

for (i in 1:nReps) {
  for (j in 1:streamsPerRep){
    projSeeds[[i]][[j]] <- s
    s <- nextRNGStream(s)
  }
}

save(projSeeds, file="projSeeds.rda")

rm(projSeeds)

load("projSeeds.rda")

##Note that origSeed does match projSeeds
origSeed
projSeeds[[1]][[1]]

## And we get same random draws from project 1, stream 1
.Random.seed <- projSeeds[[1]][[1]]
rnorm(4)
x

##Another way (preferred?) to reset stream
assign(".Random.seed", projSeeds[[1]][[1]], envir = .GlobalEnv) rnorm(4)

## Now, how to make this more systematic
## Each task needs streamsPerRep seeds
## startSeeds = for each stream, a starting seed
## currentSeeds = for each stream, a seed recording stream's current position
## currentStream = integer indicator of which stream is in use

## Test that interactively

currentStream <- 1
currentSeeds <- startSeeds <- projSeeds[[1]] .Random.seed <- startSeeds[[currentStream]]

useStream <- function(n = NULL, origin = FALSE){   if (n > length(currentSeeds)) stop("requested stream does not exist")   currentSeeds[[currentStream]] <- .Random.seed   if (origin) assign(".Random.seed", startSeeds[[n]], envir = .GlobalEnv)   else assign(".Random.seed", currentSeeds[[n]], envir = .GlobalEnv)   currentStream <<- n
}

useStream(n=1, origin=TRUE)
rnorm(4)

currentStream

useStream(n=2, origin=TRUE)
rnorm(4)

currentStream

## Now, make that work in a clustered environment

cl <- makeCluster(9, "MPI")

## run on worker, so can retrieve seeds for particular run
initSeeds <- function(p = NULL){
  currentStream <<- 1
  projSeeds[[p]]
}

clusterEvalQ(cl, {
  RNGkind("L'Ecuyer-CMRG")
})

clusterExport(cl, c("projSeeds", "useStream", "initSeeds"))

someHorrendousFunction <- function(run, parm){   currentStream <- 1
  currentSeeds <- startSeeds <- initSeeds(run)   assign(".Random.seed", startSeeds[[currentStream]], envir = .GlobalEnv)

  ##then some gigantic, long lasting computation occurs   dat <- data.frame(x1 = rnorm(parm$N), x2 = rnorm(parm$N), y = rnorm(parm$N))   m1 <- lm(y ~ x1 + x2, data=dat)
  list(m1, summary(m1), model.matrix(m1)) }

whatever <- list("N" = 999)

res <- parLapply(cl, 1:nReps, someHorrendousFunction, parm = whatever)

res[[77]]

##Prove I can repeat 77'th task

res77 <- someHorrendousFunction(77, parm = whatever)

## well, that worked.

stopCluster(cl)

-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Fri 13 Jan 2012 - 22:48:50 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 15 Jan 2012 - 07:40:09 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive