Re: [R] Sampling

From: Tim Hesterberg <timh_at_insightful.com>
Date: Wed, 06 Feb 2008 14:35:20 -0800

>Tim Hesterberg wrote:
>>I'll raise a related issue - sampling with unequal probabilities,
>>without replacement. R does the wrong thing, in my opinion:
>>...
>Peter Dalgaard wrote:
>But is that the right thing? ...
(See bottom for more of the previous messages.)

First, consider the common case, where size * max(prob) < 1 -- sampling with unequal probabilities without replacement.

Why do people do sampling with unequal probabilities, without replacement? A typical application would be sampling with probability proportional to size, or more generally where the desire is that selection probabilities match some criterion.

The default S-PLUS algorithm does that. The selection probabilities at each of step 1, 2, ..., size are all equal to prob, and the overall probabilities of selection are size*prob.

The R algorithm (and old S+ algorithm, prior to S+6.2) did not; at step 1 the selection probabilities are correct, but not in subsequent steps. With size = 2 and prob = c(.5, .25, .25), for example, the selection probabilities at step 2 are (.33, .33, .33).

The R/old S+ algorithm is the result of programming "sampling with probability proportional to prob without replacement" literally, sampling greedily at each step, ignoring dependence, without thinking about that fact that that doesn't give the desired unconditional probabilities after step 1. In practice one often needs to know the actual selection probabilities, and they are difficult to compute except in toy problems with small size.

Now turn to the less common case -- possible overselection -- sampling with unequal probabilities with minimal replacement. One example of this would be in multistage sampling. At stage one you select cities with probability proportional to prob. If size*max(prob) > 1, then a city may be selected more than once. If selected more than once, the second stage selects that many individuals from the city.

Another example of both cases is an improvement on SIR (sampling-importance-resampling). In Bayesian simulation, if importance sampling is used, each observation ends up with a value and a weight. Often one would prefer the output to be a sample without weights. In SIR this is done by sampling from the observations with probability proportional to the weight, <<with replacement>>. As a result some observations may be sampled multiple times.

It would be better to do this with minimal replacement. If size*max(prob) <= 1 then no observation is sampled multiple times. If size*max(prob) > 1 then some observations may be sampled multiple times, but there will tend to be less of this (and more unique values sampled) than if sampling with replacement.

Here's a final example of the less common case. This one is more colorful, but a bit long; at the end I'll relate this to another improvement on SIR.

"Splitting" and "Russian Roulette" are complementary techniques that date back to the early days of Monte Carlo simulation, developed for estimating the probability of neutrons or other particles penetrating a shield.

The simplest procedure is to generate a bunch of random neutrons, and follow each on its path through the shield - it flies a random distance until it hits an atom, there is a certain probability of absorption, otherwise it is reflected in a random direction, etc. In the end you count the proportion of neutrons that penetrate. The problem is that the penetration probability is so small that a you have to start simulating with an enormous number of neutrons, to accurately estimate the small probability.

The next improvement is to bias the sampling, and assign each neutron a weight. The weight W starts at 1. When the neutron hits an atom, instead of being absorbed with probability p, multiply W by p. You can also bias the direction, in favor of of the direction of the shield boundary, and counteract that sampling bias by multiplying W by the relative likelihood (likelihood of observed direction under random sampling / likelihood under biased sampling). As a neutron proceeds its W keeps getting up-weighted and down-weighted. The final estimate of penetration is based on the average W. The problem with this is that the variance of W across neutrons can be huge; most W's are very small, and a few are much larger than the others.

That is where Russian roulette and splitting come in. In the middle of a simulated path, if a neutron has a very small W, it undergoes Russian roulette - there is a probability P that it is killed; if not killed its weight is divided by (1-P). Conversely, if the W is relatively large, the particle is split into K simulated particles, each with initial weight W/K, which are subsequently simulated independently. The two techniques together reduce the variance of W across neutrons.

Sampling with unequal probabilities and minimal replacement can be used to similar effect. Sample from the current population of particles with probabilities proportional to 'prob', then divide each particle's W by (size*prob). Particles that are sampled more than once are split.

Now relate that back to Bayesian importance sampling performed sequentially. Say the simulation is split into two stages. After the first stage there are a number of observations with associated weights. At this point the observations can be subject to Russian roulette and splitting, or sampling with minimal replacement and unequal probabilities, to reduce the variance of the weights entering the second stage. The 'prob' could be proportional to W, or some other function such as sqrt(W).

>Tim Hesterberg wrote:
>>I'll raise a related issue - sampling with unequal probabilities,
>>without replacement. R does the wrong thing, in my opinion:
>>
>>> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25)))
>>> table(values)
>>>
>> values
>> 1 2 3
>> 834 574 592
>>
>> The selection probabilities are not proportional to the specified
>> probabilities.
>>
>> In contrast, in S-PLUS:
>>
>>> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25)))
>>> table(values)
>>>
>> 1 2 3
>> 1000 501 499
>>
>Peter Dalgaard wrote:
>But is that the right thing? If you can use prob=c(.6, .2, .2) and get
>1200 - 400 - 400, then I'm not going to play poker with you....
>
>The interpretation in R is that you are dealing with "fat cards", i.e.
>card 1 is twice as thick as the other two, so there is 50% chance of
>getting the _first_ card as a 1 and additionally, (.25+.25)*2/3 to get
>the 2nd card as a 1 for a total of .8333. And since the two cards are
>different the expected number of occurrences of card 1 in 1000 samples
>is 833. What is the interpretation in S-PLUS?



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 06 Feb 2008 - 22:39:02 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 07 Feb 2008 - 18:30:12 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive