Re: [Rd] efficiency of sample() with prob.

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Mon 29 Aug 2005 - 18:17:03 GMT

Peter Windridge and I investigated this further. It seems the distribution used for your tests is maximally favourable to your proposal (not uncommon in papers, but not very honest). There are examples in which the existing method is substantially faster than Walker's (when each set of prob is used only a few times and there are many sets) and only a few areas in which there are large gains to be had. For example, with binomial(N=10000, p=0.25) Walker's method is not twice as fast, and 10 million samples only take a second or two. (Tabulating them takes quite a bit longer.) OTOH, with a distribution concentrated on 5 out of 10000 points, the existing method is twice as fast.

Changing how this is done will break the reproducibility of past programs, and we don't really want to introduce yet more options. So it seems only worth doing when there are substantial speed gains. I am still tuning this, but it seems that it is worthwhile if the number of probs >= 0.1/N is more than 200 or so. The advantage of Walker's method is that its speed is more-or-less independent of probs (unless they are actually all identical!), but equally that can be a disadvantage.

On Thu, 23 Jun 2005, Bo Peng wrote:

>> We suggest that you take up your own suggestion, research this area and
>> offer the R project the results of your research for consideration as your
>> contribution.
>
> I implemented Walker's alias method and re-compiled R. Here is what
> I did:

[...]

>
>> b=sample(seq(1,100), prob=seq(1,100), replace=TRUE, size=1000000)
>> table(b)/1000000*sum(seq(1,100))
>
> 4. run the following code in current R 2.1.0 and updated R.
>
> for(prob in seq(1,4)){
> for(sample in seq(1,4)){
> x = seq(1:(10^prob)) # short to long x
> p = abs(rnorm(length(x))) # prob vector
> times = 10^(6-prob) # run shorter cases more times
> Rprof(paste("sample_", prob, "_", sample, ".prof", sep=''))
> for(t in seq(1,times)){
> sample(x, prob=p, size=10^sample, replace=TRUE )
> }
> Rprof(NULL)
> }
> }
>
> Basically, I tried to test the performance of sample(replace=TRUE, prob=..)
> with different length of x and size.
>
> 5. process the profiles and here is the result.
> p: length of prob and x
> size: size of sample
> cell: execution time of old/updated sample()
>
> size\p 10 10^2 10^3 10^4
> 10 2.4/1.6 0.32/0.22 0.20/0.08 0.24/0.06
> 10^2 3.1/2.6 0.48/0.28 0.28/0.06 0.30/0.06
> 10^3 11.8/11.1 1.84/1.14 0.94/0.18 0.96/0.08
> 10^4 96.8/96.6 15.34/9.68 7.54/1.06 7.48/0.16
> run: 10000 1000 100 10 times
>
> We can see that the alias method is quicker than the linear search
> method in all cases. The performance difference is greatest (>50 times)
> when the original algorithm need to search in a long prob vector.
>
> I have not thoroughly tested the new function. I will do so if you
> (the developers) think that this has the potential to be incorporated
> into R.
>
> Thanks.
>
> Bo Peng
> Department of Statistics
> Rice University
>
>

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Tue Aug 30 04:21:40 2005

This archive was generated by hypermail 2.1.8 : Mon 20 Feb 2006 - 03:21:19 GMT