Re: [R] Sampling

From: Tim Hesterberg <timh_at_insightful.com>
Date: Thu, 07 Feb 2008 11:25:48 -0800

Thomas Lumley wrote:
>On Wed, 6 Feb 2008, Tim Hesterberg wrote:
>
>>> Tim Hesterberg wrote:
>>>> I'll raise a related issue - sampling with unequal probabilities,
>>>> without replacement. R does the wrong thing, in my opinion:
>>>> ...
>>> Peter Dalgaard wrote:
>>> But is that the right thing? ...
>> (See bottom for more of the previous messages.)
>>
>>
>> First, consider the common case, where size * max(prob) < 1 --
>> sampling with unequal probabilities without replacement.
>>
>> Why do people do sampling with unequal probabilities, without
>> replacement? A typical application would be sampling with probability
>> proportional to size, or more generally where the desire is that
>> selection probabilities match some criterion.
>
>In real survey PPS sampling it also matters what the pairwise joint
>selection probabilities are -- and there are *many* algorithms, with
>different properties. Yves Till'e has written an R package that implements
>some of them, and the pps package implements others.

Brewer & Hanif (1983) Sampling with unequal probabilities, Springer give 50 algorithms.

>> The default S-PLUS algorithm does that. The selection probabilities
>> at each of step 1, 2, ..., size are all equal to prob, and the overall
>> probabilities of selection are size*prob.
>
>Umm, no, they aren't.
>
>Splus 7.0.3 doesn't say explicitly what its algorithm is, but is happy to
>take a sample of size 10 from a population of size 10 with unequal
>sampling probabilities. The overall selection probability *can't* be
>anything other than 1 for each element -- sampling without replacement and
>proportional to any other set of probabilities is impossible.

The default S-PLUS algorithm when probabilities are supplied is sampling with minimal replacement, not without replacement. "Minimal replacement" is not interpreted strictly - duplicates may occur if (size*max(prob)>1), so that selection probabilities are right at every step:
> 10 * (1:10 / sum(1:10)) # expected counts
 [1] 0.1818182 0.3636364 0.5454545 0.7272727 0.9090909 1.0909091 1.2727273  [8] 1.4545455 1.6363636 1.8181818
> sample(10, size=10, prob = 1:10) # default arguments give minimal replacement
 [1] 9 9 5 10 3 10 7 6 7 8

One correction to my statement - in the case that size * max(prob) > 1, replace replace "overall probabilities of selection are size*prob" with "expected number of times each observation is selected is size*prob".

In contrast, minimal=F gives the same algorithm as R, without replacement:
> sample(10, size=10, prob = 1:10, minimal = F)
 [1] 8 10 9 6 7 3 4 5 1 2
Note that the later draws tend to be observations with small prob, because that is what is left over.

Here is a quick simulation to demonstrate selection probabilities:
> values <- sapply(1:10^3, function(i) sample(5, size=3, prob = 1:5))
> apply(values, 1, table) / 10^3

In S-PLUS, with minimal replacement

   [,1] [,2] [,3]

1 0.068 0.060 0.065
2 0.145 0.140 0.137
3 0.181 0.214 0.197
4 0.274 0.243 0.276
5 0.332 0.343 0.325

In R:

   [,1] [,2] [,3]

1 0.051 0.082 0.117
2 0.140 0.159 0.211
3 0.208 0.209 0.230
4 0.269 0.261 0.230
5 0.332 0.289 0.212

Column k corresponds to the kth draw. Note that in S+ all columns match the specified probabilities prob = (1:5/15), while in R only the first column does.

>Splus 7.0.3 doesn't say explicitly what its algorithm is, ...

I'll add it to the help file, and post to r-devel.

>Even in a milder case -- samples of size 5 from 1:10 with probabilities
>proportional to 1:10 -- the deviation is noticeable in 1000 replications.
>In this case sampling with the specified probabilities is actually
>possible, but S-PLUS doesn't do it.

I think you're looking at the case of sampling without replacement, not sampling with minimal replacement.

For sampling without replacement S-PLUS uses the same algorithm as R -- at each step, draw an observation with probabilities proportional to prob, omitting observations already drawn.

>Now, it might be useful to add another replace=FALSE sampler to sample(),
>such as the newish Conditional Poisson Sampler based on the work of
>S.X.Chen. This does give correct marginal probabilities of inclusion, and
>the pairwise joint probabilities are not too hard to compute.

That could be interesting. The pairwise joint probabilities are important in some applications. Are all the pairwise joint probabilities nonzero (when size*prob allows that)?

Pedro J. Saavedra (2005)
Comparison of Two Weighting Schemes for Sampling with Minimal Replacement http://www.amstat.org/Sections/Srms/Proceedings/y2005/Files/JSM2005-000882.pdf finds that the other scheme has some advantages over the one I used.

A historical note - Saavedra notes that Chromy (1979) had the concept of sampling with minimal replacement under a different name. I used the term in the S+Resample version released 8/26/2002.

Note that any algorithm for sampling without replacement can be extended to sampling with minimal replacement, by first drawing observations using the integer part of (size*prob), then reducing size and adjusting prob prior to random drawing.

>I don't think that dropping the current sequential PPS implementation is
>a good idea. The help page does explain the algorithm, though it might be
>useful to add an explicit note that the marginal probabilities of sampling
>are not the supplied probabilities.
>
> -thomas

I don't think it should be dropped; it should be available for backward compatibility. But it should not be the default.

Tim



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 07 Feb 2008 - 19:30:35 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 07 Feb 2008 - 21:30:12 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive