[Rd] predictable bit patterns in runif(n) shortly after set.seed

From: Petr Savicky <savicky_at_cs.cas.cz>
Date: Wed, 17 Oct 2007 18:11:16 +0200


Mersenne Twister generator is known to be sensitive to the algorithm used to generate its initial state. The initialization used in R generates the initial state in a way, which leaves linear dependencies mod 2 among the bits in the initial state. Since Mersenne Twister performs only operations, which are linear mod 2, these dependencies propagate to the output sequence.

An easy to see consequence of this may be demonstrated by the following script:

  pattern <- function(m=1400)
  {

      x <- runif(m)
      y <- matrix(nrow=m,ncol=32)
      for (j in 1:32) {
          y[,j] <- floor(2*x)
          x <- 2*x - y[,j]
      }
      u <- rep(0,times=32)
      u[c(3,7,10,14,21,25,32)] <- 1
      c(y %*% u) %% 2

}

  RNGkind("default") # or set.seed() with any seed   z <- pattern()
  abs(diff(z,lag=2)) # sequence with long constant subsequences

It should be pointed out that it is indeed a consequence of the initialization. If e.g. runif(10000) is run after RNGkind/set.seed, then the effect disappears.

Note that each row in matrix y used in function pattern() contains the bit representation of one of the numbers from runif(m). Different elements of z are derived from different rows in y and, hence, from different elements of runif(m). Consequently, they should mimic an i.i.d. sequence. However, abs(diff(z,lag=2)) allows to reject such an assumption easily.

The pattern is even better visible graphically, for example using   for (i in 1:20) {

      set.seed(i)
      z <- pattern()
      z <- abs(diff(z,lag=2))
      if (i == 1) {
          plot(cumsum(2*z-1),type="l")
      } else {
          lines(cumsum(2*z-1))
      }

}

The resulting curve is almost the same for all the seeds.

I have a working patch, which solves this problem by adding a new generator called Mersenne-Twister-52, which is the standard Mersenne Twister with the following modifications:

Combining the two changes together allows to keep the current Mersenne Twister implementation intact for backward compatibility and provides more reasons to add a new name than just a different seeding.

In my opinion, there may be applications, which can benefit from more then 32 random bits in the numbers from runif(n).

I would be pleased to send the patch to R-devel, if the proposed solution is of the sort, which could be considered.

Petr Savicky.



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 17 Oct 2007 - 16:16:40 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 25 Oct 2007 - 11:37:11 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.