Re: [Rd] efficiency of sample() with prob.

From: Bo Peng <ben.bob_at_gmail.com>
Date: Thu 23 Jun 2005 - 21:37:35 GMT

> We suggest that you take up your own suggestion, research this area and
> offer the R project the results of your research for consideration as your
> contribution.

I implemented Walker's alias method and re-compiled R. Here is what I did:

  1. replace function ProcSampleReplace in R-2.1.0/src/main/random.c with the following one

static void ProbSampleReplace(int n, double *p, int *perm, int nans, int *ans) {

    /* allocate memory for a, p and HL */     double * q = Calloc(n, double);

    int * a = Calloc(n, int);
    int * HL = Calloc(n, int); 
    int * H = HL;       
    int * L = HL+n-1;      

    int i, j, k;
    double rU; /* U[0,1)*n */

    /* set up alias table */
    /* initialize q with n*p0,...n*p_n-1 */     for(i=0; i<n; ++i)

        q[i] = p[i]*n;

    /* initialize a with indices */
    for(i=0; i<n; ++i)

        a[i] = i;

    /* set up H and L */
    for(i=0; i<n; ++i) {

        if( q[i] >= 1.)
            *H++ = i;
        else
            *L-- = i;

    }

    while( H != HL && L != HL+n-1) {

        j = *(L+1);
        k = *(H-1);
        a[j] = k;
        q[k] += q[j] - 1;
        L++;                                  /* remove j from L */
        if( q[k] < 1. ) {
            *L-- = k;                         /* add k to L */
            --H;                              /* remove k */
        }

    }

    /* generate sample */
    for (i = 0; i < nans; ++i) {

        rU = unif_rand() * n;

        k = (int)(rU);
        rU -= k;  /* rU becomes rU-[rU] */

        if( rU < q[k] )
            ans[i] = k+1;
        else
            ans[i] = a[k]+1;

    }
    Free(HL);
    Free(a);
    Free(q); 

}

2. make and make install

3. test the new sample function by code like

> b=sample(seq(1,100), prob=seq(1,100), replace=TRUE, size=1000000)
> table(b)/1000000*sum(seq(1,100))

4. run the following code in current R 2.1.0 and updated R.

for(prob in seq(1,4)){
  for(sample in seq(1,4)){
    x = seq(1:(10^prob)) # short to long x     p = abs(rnorm(length(x))) # prob vector     times = 10^(6-prob) # run shorter cases more times     Rprof(paste("sample_", prob, "_", sample, ".prof", sep=''))     for(t in seq(1,times)){
      sample(x, prob=p, size=10^sample, replace=TRUE )     }
    Rprof(NULL)
  }
}

Basically, I tried to test the performance of sample(replace=TRUE, prob=..) with different length of x and size.

5. process the profiles and here is the result. p: length of prob and x
size: size of sample
cell: execution time of old/updated sample()

  size\p    10          10^2        10^3       10^4
  10       2.4/1.6      0.32/0.22   0.20/0.08  0.24/0.06  
  10^2     3.1/2.6      0.48/0.28   0.28/0.06  0.30/0.06
  10^3     11.8/11.1    1.84/1.14   0.94/0.18  0.96/0.08
  10^4     96.8/96.6    15.34/9.68  7.54/1.06  7.48/0.16
  run:     10000        1000        100        10 times

We can see that the alias method is quicker than the linear search method in all cases. The performance difference is greatest (>50 times) when the original algorithm need to search in a long prob vector.

I have not thoroughly tested the new function. I will do so if you (the developers) think that this has the potential to be incorporated into R.

Thanks.

Bo Peng
Department of Statistics
Rice University



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Fri Jun 24 07:42:08 2005

This archive was generated by hypermail 2.1.8 : Mon 20 Feb 2006 - 03:21:09 GMT