# Re: [R] Survey - Cluster Sampling

From: Thomas Lumley <tlumley_at_u.washington.edu>
Date: Fri 17 Jun 2005 - 01:01:08 EST

On Thu, 16 Jun 2005, Mark Hempelmann wrote:

> Dear WizaRds,
>
> I am struggling to compute correctly a cluster sampling design. I want
> to do one stage clustering with different parametric changes:
>
> Let M be the total number of clusters in the population, and m the
> number sampled. Let N be the total of elements in the population and n
> the number sampled. y are the values sampled. This is my example data:
>
> clus1 <- data.frame(cluster=c(1,1,1,2,2,2,3,3,3), id=seq(1:3,3),
> weight=rep(72/9,9), nl=rep(3,9), Nl=rep(3,9), N=rep(72,9), y=c(23,33,77,
> 25,35,74, 27,37,72) )
>
> 1. Let M=m=3 and N=n=9. Then:
>
> dclus1<-svydesign(id=~cluster, data=clus1)
> svymean(~y, dclus1)
>
> mean SE
> y 44.778 0.294, the unweighted mean, assuming equal probability in the
> clusters. ok.

Yes.

> 2. Let M=23, m=3 and N=72, n=9, then I am unable to use svydesign correctly:
>
> dclus2<-svydesign(id=~cluster, data=clus1, fpc=~N)
> svymean(~y, dclus2)
>
> mean SE
> y 44.778 0.2878, but it should be 23/72 * 1/3(133+134+136)=42.91, since
> I have to include the total number of clusters/total population M/N into
> the estimator. How can I include the information of the total number of
> clusters?

The fpc term should be the total number of clusters, so 23 rather than 72. clus1\$M<-rep(23,9)
dclus2a<-svydesign(id=~cluster, data=clus1, fpc=~M) svymean(~y, dclus2a)

Now, this still gives 44.778, because each observation still has the same weight. It describes a one-stage cluster sampling design where each cluster has only three elements. This is an equal-probability sampling design. Any equal-probability sampling design will give the same estimated mean.

If your design was to take a simple random sample of three clusters and then take all the elements in each cluster then dclus2a is giving the correct mean (well, the one I wanted it to give). Estimates of the population total will be different, but not the mean.

Your expected estimate of the mean is also a reasonable one. In survey statistics there is often more than one reasonable estimator even for something as simple as the mean. My estimator is sum(weights*y)/sum(weights), which has some practical advantages: it is easy to generalise to more complex designs (including things like post-stratification), it can be computed without knowing the sampling design (which is important when using replicate weights to compute variances), it is the definition of the mean that agrees with linear regression models, and it is what Stata uses, making it easier to compare results.

Your estimator uses the expected value of the denominator rather than the observed value. This probably implies that your estimator is design-unbiased and mine isn't. Since there aren't design-unbiased estimators for most statistics more complicated than the mean I don't worry so much about it.

You might also have had a sampling design where you took a simple random sample of three clusters and then up to three elements from each cluster.

dclus2b<-svydesign(id=~cluster+id, fpc=~M+nl, data=clus1) This gives the same mean as dclus2a, because in fact you sampled 100% of each sampled cluster.

> 3. How do I work with weights correctly? I understand that weights imply
> inverse probability weighting 1/p with p=n/N in simple sampling, in
> our case 72/9=8, because I sample 9 units out of a total population of
> 72. Again, I couldn't tell survey the number of total clusters M. So:
>
> dclus3<-svydesign(id=~cluster, weights=~weight, data=clus1, fpc=~N)
> svymean(~y, dclus3)
>
> mean SE
> y 44.778 0.2878, still exactly the same numbers, although I provided the
> weights. What am I doing wrong?

Again, fpc should be M rather than N. The help page says that the relevant population size is in "sampling units" (ie, clusters). It used to say PSUs before the package was extended to handle multistage fpcs, which was probably clearer but now wouldn't be true.

Apart from that you aren't doing anything wrong. The mean should still be the same as the unweighted mean because you are giving each observation the same weight. And it is.

The total won't be the same as dclus2a and dclus2b, because you are now telling R the population size in elements as well as in PSUs.

-thomas

R-help@stat.math.ethz.ch mailing list