Re: [R] Survey and Stratification

From: Thomas Lumley <tlumley_at_u.washington.edu>
Date: Fri 27 May 2005 - 06:04:32 EST

On Thu, 26 May 2005, Mark Hempelmann wrote:

> Dear WizaRds,
>
> Working through sampling theory, I tried to comprehend the concept of
> stratification and apply it with Survey to a small example. My question is
> more of theoretic nature, so I apologize if this does not fully fit this
> board's intention, but I have come to a complete stop in my efforts and need
> an expert to help me along. Please help:
>
> age<-matrix(c(rep(1,5), rep(2,3), 1:8, rep(3,5), rep(4,3), rep(5,5),
> rep(3,3), rep(15,5), rep(12,3), 23,25,27,21,22, 33,27,29), ncol=6, byrow=F)
> colnames(age)<-c("stratum", "id", "weight", "nh", "Nh", "y")
> age<-as.data.frame(age)

Ok. Assuming that Nh are the population sizes in each stratum, you have 5/15 sampled in stratum 1 and 3/12 in stratum 2.

This can be specified in a number of ways You can use

   sampling weights of 15/5 and 12/3
   sampling probabilities of 5/15 and 3/12 without or without specifiying the finite population correction. The finite population correction can be specified as 15 and 12 or 5/15 and 3/12, and if the finite population correction is specified the weights are then optional.

So

   d1<-svydesign(ids=~id, strata=~stratum, weight=~I(Nh/nh), data=age)    d2<-svydesign(ids=~id, strata=~stratum, prob=~I(nh/Nh), data=age) give the with-replacement design (agreeing with your age.des3) and

   d3<-svydesign(ids=~id, strata=~stratum, weight=~I(Nh/nh), fpc=~Nh,data=age)
   d4<-svydesign(ids=~id, strata=~stratum, prob=~I(nh/Nh), fpc=~Nh,data=age)
   d5<-svydesign(ids=~id, strata=~stratum, weight=~I(Nh/nh), fpc=~I(nh/Nh),data=age)
   d6<-svydesign(ids=~id, strata=~stratum, prob=~I(nh/Nh), fpc=~I(nh/Nh),data=age)
   d7<-svydesign(ids=~id, strata=~stratum, fpc=~Nh,data=age)
   d8<-svydesign(ids=~id, strata=~stratum, fpc=~I(nh/Nh),data=age)
all give the without-replacement design. We get
> svymean(~y,d1)

     mean SE
y 26.296 0.9862
> svymean(~y,d2)

     mean SE
y 26.296 0.9862
> svymean(~y,d3)

     mean SE
y 26.296 0.8364
> svymean(~y,d4)

     mean SE
y 26.296 0.8364
> svymean(~y,d5)

     mean SE
y 26.296 0.8364
> svymean(~y,d6)

     mean SE
y 26.296 0.8364
> svymean(~y,d7)

     mean SE
y 26.296 0.8364
> svymean(~y,d8)

     mean SE
y 26.296 0.8364

Now, looking at your examples
> ## create survey design object
> age.des1<-svydesign(ids=~id, strata=~stratum, weight=~Nh, data=age)
> svymean(~y, age.des1)
> ## gives mean 25.568, SE 0.9257

This is wrong: the sampling weight is Nh/nh, not Nh

> age.des2<-svydesign(ids=~id, strata=~stratum, weight=~I(nh/Nh), data=age)
> svymean(~y, age.des2)
> ## gives mean 25.483, SE 0.9227

This is wrong: the sampling weight is Nh/nh. You need prob=~I(nh/Nh) to specify sampling fractions.

> age.des3<-svydesign(ids=~id, strata=~stratum, weight=~weight, data=age)
> svymean(~y, age.des3)
> ## gives mean 26.296, SE 0.9862

This is correct and agrees with d1 and d2

> age.des4<-svydesign(ids=~id, strata=~stratum, data=age)
> svymean(~y, age.des4)
> ## gives mean 25.875, SE 0.9437

This is a stratified, unweighted mean, ie mean(age$y).

> age.des3 is the only estimator I am able to compute per hand correctly. It is
> stratified random sampling with inverse probablility weighting with weight=
> nh/Nh ## sample size/ stratum size.
>
> Basically, I thought the option weight=~Nh as well as weight=~I(nh/Nh) would
> result in the same number, but it does not.

No, it does not. A weight of 3 is not the same as a weight of 1/3. With the finite population correction it is safe to assume that numbers less than 1 are sampling fractions and numbers greater than 1 are population sizes, but this isn't safe when it comes to weights. It is possible that someone could want to use sampling weights less than 1.

>
> I thought the Hansen-Hurwitz estimator per stratum offers the right numbers:
> p1=5/15, p2=3/12, so y1.total=1/5*(3*118), y2.total=1/3*(4*89) and the
> stratified estimator with this design should be: 1/27(y1.total+y2.total),
> obviously wrong.

Since this gives a mean of 7.01 for numbers around 25 it can't be right. You have divided by sample size twice. You should have

   y1.total<-3*118
   y2.total<-4*89
You then will get (y1.total+y2.total)/27 to be 26.29630, in agreement with svymean().

         -thomas



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Fri May 27 06:07:28 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:32:08 EST