Re: [R] Sampling

From: Thomas Lumley <>
Date: Thu, 7 Feb 2008 09:34:08 -0800 (PST)

On Wed, 6 Feb 2008, Tim Hesterberg wrote:

>> Tim Hesterberg wrote:
>>> I'll raise a related issue - sampling with unequal probabilities,
>>> without replacement. R does the wrong thing, in my opinion:
>>> ...
>> Peter Dalgaard wrote:
>> But is that the right thing? ...
> (See bottom for more of the previous messages.)
> First, consider the common case, where size * max(prob) < 1 --
> sampling with unequal probabilities without replacement.
> Why do people do sampling with unequal probabilities, without
> replacement? A typical application would be sampling with probability
> proportional to size, or more generally where the desire is that
> selection probabilities match some criterion.

In real survey PPS sampling it also matters what the pairwise joint selection probabilities are -- and there are *many* algorithms, with different properties. Yves Till'e has written an R package that implements some of them, and the pps package implements others.

> The default S-PLUS algorithm does that. The selection probabilities
> at each of step 1, 2, ..., size are all equal to prob, and the overall
> probabilities of selection are size*prob.

Umm, no, they aren't.

Splus 7.0.3 doesn't say explicitly what its algorithm is, but is happy to take a sample of size 10 from a population of size 10 with unequal sampling probabilities. The overall selection probability *can't* be anything other than 1 for each element -- sampling without replacement and proportional to any other set of probabilities is impossible.

Even in a milder case -- samples of size 5 from 1:10 with probabilities proportional to 1:10 -- the deviation is noticeable in 1000 replications. In this case sampling with the specified probabilities is actually possible, but S-PLUS doesn't do it.

Now, it might be useful to add another replace=FALSE sampler to sample(), such as the newish Conditional Poisson Sampler based on the work of S.X.Chen. This does give correct marginal probabilities of inclusion, and the pairwise joint probabilities are not too hard to compute.

I don't think that dropping the current sequential PPS implementation is a good idea. The help page does explain the algorithm, though it might be useful to add an explicit note that the marginal probabilities of sampling are not the supplied probabilities.


Thomas Lumley			Assoc. Professor, Biostatistics	University of Washington, Seattle

______________________________________________ mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Thu 07 Feb 2008 - 17:36:50 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 07 Feb 2008 - 20:30:12 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.

list of date sections of archive