RE: [R] Generating a binomial random variable correlated with a

From: Ted Harding <Ted.Harding_at_nessie.mcc.ac.uk>
Date: Sun 17 Apr 2005 - 20:52:49 EST


On 16-Apr-05 Ashraf Chaudhary wrote:
> Ted:
> Thank you for your help. All I want is a binomial random
> variable that is correlated with a normal random variable
> with specified correlation. By linear I mean the ordinary
> Pearson correlation. I tried the following two methods,
> in each case the resulting correlation is substantially
> less than the one specified.
>
> Method I: Generate two correlated normals (using Cholesky
> decomposition method) and dichotomize one (0/1) to get the
> binomial. Method II: Generate two correlated variables, one
> binomial and one normal using the Cholesky decomposition methods.
>
> Here is how I did:
>
> X <- rnorm(100)
> Y <- rnorm(100)
> r<- 0.7
> Y1 <- X*r+Y*sqrt(1-r**2)
> cor(X,Y1) # Correlated normals using Cholesky decomposition
> cor(X>0.84,Y1) # Method I
>

>##

> X1 <- rbinom(100,1,0.5)
> Y2 <- X1*r+Y*sqrt(1-r**2)
> cor(X1,Y2); # Method II

Hello Ashraf,

The above more explicit explanation certainly helps to clarify your question! (I echo Bill's counsel to spell out essential detail).

I'll lead on from your "second method" above, which makes it explicit that, to each of your 100 values (Y) sampled from rnorm you are sampling from {0,1} to get one of your 100 binomial (n=1) variables (X1). However, in this second method you also create the derived variable Y2 = X1*r+Y*sqrt(1-r**2), and apparently you want this (Y2) to be the Normal variate which is correlated with X1.

Unfortunately, given the way Y2 is constructed, it does not have a Normal distribution. This is almost obvious from the fact that, conditional on the number of 1s in X1, the values of Y2 are a mixture of N(0,1-r^2) and N(r,1-r^2).

You can explore this in R by looking at the histogram of a much larger sample of Y2. Thus:

  r<- 0.7
  Y <- rnorm(100000)
  X1 <- rbinom(100000,1,0.5)
  Y2 <- X1*r+Y*sqrt(1-r**2)
  m<-mean(Y2) ; s<-sd(Y2)
  hist(Y2,breaks=0.1*(-45:45))
  lines(0.1*(-45:45),0.1*100000*dnorm(0.1*(-45:45),m,s))

Do it once, and the fit looks pretty good. However, if you repeat the above commands several times, you will observe that, near the peak of the curve, there is a definite tendency for the peak of the histogram to lie below the peak of the fitted curve. This illustrates the fact that you are looking at a superposition of two Normal curves, with identical (since you have p=0.5 in the Binomial sample) proportions and variances, but slightly shifted relative to each other (by an amount r which is in magnitude less than 1). This superposition is flatter at the top than any single Normal curve would be, which is being reflected in the "deficiency" of the histogram relative to the fitted curve

You can also see this theoretically, since your Y2 is

  Y2 = A*X1 + B*Y

so that

  P[Y2 <= y] = p*P[B*Y <= y] + (1-p)*P[B*Y <= y - A]

where p is the Binomial P[X1 = 1].

Hence the density function of Y2 is the derivative of this w.r.to y, namely

  p*f(y/B)/B + (1-p)*f((y-A)/B)/B

where f(x) is the density function of the standard N(0,1).

While in the case of your example (see histograms) the distribution of Y2 might be close enough to Normal for practical purposes (but this depends on your practical purpose), it is nonetheless better to try to avoid the theoretical difficulty if possible.

So I'd suggest experimenting on the following lines.

  1. Let X1 be a sample of size N using rbinom(N,1,p) (where, in general, p need not be 0.5)
  2. Let Y be a sample of size N using rnorm(N,mu,sigma) (and again, in general, mu need not be 0 nor sigma 1).

This is as in your example. So far, you have a true Binomial variate X1 and a true Normal variate Y.

3. X1 will have r 1s in it, and (N-r) 0s. Now consider

   setting up a correspondence between the 1s and 0s in    X1 and the values in Y, in such a way that the 1s    tend to get allocated to the higher values of Y and    the 0s to lower values of Y. The resulting set of    pairs (X1,Y) is then your correlated sample.

   In other words, permute the sample X1 and cbind the    result to Y.

This is similar to your "dichotomy" method (and would be almost identical if you simply allocated the r 1s to the r largest Ys), but is more flexible since I'm only saying "tend". In other words, consider sampling from the N Binomial 0s and 1s according to a non-uniform probability distribution.

The theoretical benefit from doing it this way (provided you can arrange the sampling in (3) so as to get your desired correlation) is that, by construction, the marginal distributions of X1 and Y are exactly as they were to start with, namely Binomial and Normal respectively.

Here is an example of a possible approach:

  X0 <- rbinom(100,1,0.5)
  Y <- rnorm(100) ; Y<-sort(Y)
  p0<-((1:100)-0.5)/100 ; p0<-p0/sum(p0); p<-cumsum(p0)   r<-sum(X0); ix<-sample((1:100),r,replace=FALSE,prob=p)   X1<-numeric(100); X1[ix]<-1;sum(X1)
  ## [1] 50
  cor(X1,Y)
  ## [1] 0.6150937

showing that it's quite easy to get close to your (example) correlation of 0..7 by simple means. (Note that X1, as constructed above, is equivalent to the original Binomial sample X0, since it has the same number of 0s and 1s; it's just a matter of rearanging these so as to tend to line up with the higher values of Y).

To refine the method (on this approach) play with the way that p is derived from p0 (the second line above). You might look at some of the suggestions in Bill Venables' (private) mail.

Even something as banal as

  X1 <- sort(rbinom(100,1,0.5))
  Y <- sort(rnorm(100))
  cor(X1,Y)
  ## [1] 0.768373

gives an interesting result, though this is far too inflexible for general purposes.

There may even be a theoretical way of arranging the 1s so as to get the desired correlation exactly on average -- my nose indicates that this may be so, but I have not followed it!

Best wishes,
Ted.



E-Mail: (Ted Harding) <Ted.Harding@nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861
Date: 17-Apr-05                                       Time: 11:52:49
------------------------------ XFMail ------------------------------

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sun Apr 17 21:22:55 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:31:13 EST