# Re: [R] Sampling

From: Tim Hesterberg <timh_at_insightful.com>
Date: Wed, 06 Feb 2008 10:49:24 -0800

> I want to generate different samples using the
>followindg code:
>
>g<-sample(LETTERS[1:2], 24, replace=T)
>
> How can I specify that I need 12 "A"s and 12 "B"s?

I introduced the concept of "sampling with minimal replacement" into the S-PLUS version of sample to handle things like this:

sample(LETTERS[1:2], 24, minimal = T)

This is very useful in variance reduction applications, to approximately stratify but with introducing bias. I'd like to see this in R.

The selection probabilities are not proportional to the specified probabilities.

In contrast, in S-PLUS:
> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25)))
> table(values)

1 2 3
1000 501 499

You can specify minimal = FALSE to get the same behavior as R:
> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25), minimal = F))
> table(values)

1 2 3
844 592 564

There is a reason this is associated with the concept of sampling with minimal replacement. Consider for example:

sample(1:4, size = 3, prob = 1:4/10)
The expected frequencies of (1,2,3,4) should be proportional to size*prob = c(.3,.6,.9,1.2). That isn't possible when sampling without replacement. Sampling with minimal replacement allows this; observation 4 is included in every sample, and is included twice in 20% of the samples.

Tim Hesterberg

Disclaimer - these are my opinions, not those of my employer.

