Re: [R] Sampling

From: Tim Hesterberg <timh_at_insightful.com>
Date: Wed, 06 Feb 2008 10:49:24 -0800

> I want to generate different samples using the
>followindg code:
>
>g<-sample(LETTERS[1:2], 24, replace=T)
>
> How can I specify that I need 12 "A"s and 12 "B"s?

I introduced the concept of "sampling with minimal replacement" into the S-PLUS version of sample to handle things like this:

        sample(LETTERS[1:2], 24, minimal = T)

This is very useful in variance reduction applications, to approximately stratify but with introducing bias. I'd like to see this in R.

I'll raise a related issue - sampling with unequal probabilities, without replacement. R does the wrong thing, in my opinion:

> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25)))
> table(values)


values
  1 2 3
834 574 592

The selection probabilities are not proportional to the specified probabilities.

In contrast, in S-PLUS:
> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25)))
> table(values)

    1 2 3
 1000 501 499

You can specify minimal = FALSE to get the same behavior as R:
> values <- sapply(1:1000, function(i) sample(1:3, size=2, prob = c(.5, .25, .25), minimal = F))
> table(values)

   1 2 3
 844 592 564

There is a reason this is associated with the concept of sampling with minimal replacement. Consider for example:

        sample(1:4, size = 3, prob = 1:4/10)
The expected frequencies of (1,2,3,4) should be proportional to size*prob = c(.3,.6,.9,1.2). That isn't possible when sampling without replacement. Sampling with minimal replacement allows this; observation 4 is included in every sample, and is included twice in 20% of the samples.

Tim Hesterberg

Disclaimer - these are my opinions, not those of my employer.


| Tim Hesterberg       Senior Research Scientist       |
| timh_at_insightful.com  Insightful Corp.                |
| (206)802-2319        1700 Westlake Ave. N, Suite 500 |
| (206)283-8691 (fax)  Seattle, WA 98109-3044, U.S.A.  |
|                      www.insightful.com/Hesterberg   |
========================================================
I'll teach short courses:
Advanced Programming in S-PLUS: San Antonio TX, March 26-27, 2008. Bootstrap Methods and Permutation Tests: San Antonio, March 28, 2008.

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 06 Feb 2008 - 19:11:11 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 06 Feb 2008 - 20:30:12 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive