From: Thomas Lumley <tlumley_at_u.washington.edu>

Date: Tue 06 Jun 2006 - 02:42:19 EST

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Tue Jun 06 03:57:45 2006

Date: Tue 06 Jun 2006 - 02:42:19 EST

On Mon, 5 Jun 2006, Mark Hempelmann wrote:

> Dear WizaRds,

*>
**> I am struggling with the use of twophase in package survey. My goal
**> is to compute a simple example in two phase sampling:
**>
**> phase 1: I sample n1=1000 circuit boards and find 80 non functional
**> phase 2: Given the n1=1000 sample I sample n2=100 and find 15 non
**> functional. Let's say, phase 2 shows this result together with phase 1:
**> ...................phase1........
**> ...................ok defunct....
**> phase2 ok..........85....0.....85
**> .......defunct......5...10.....15
**> sum................90...10....100
**>
**> That is in R:
**> fail <- data.frame(id=1:1000 , x=c(rep(0,920), rep(1,80)),
**> y=c(rep(0,985), rep(1,15)), n1=rep(1000,1000), n2=rep(100,1000),
**> N=rep(5000,1000))
**>
**> des.fail <- twophase(id=list(~id,~id), data=fail, subset=~I(x==1))
**> # fpc=list(~n1,~n2)
*

The second-phase sample is described by subset=~I(x==1), so you have sampled only 80 in phase two, not 100.

> svymean(~y, des.fail)

*>
**> gives mean y 0.1875, SE 0.0196, but theoretically,
**> we have x.bar1 (phase1)=0.08 and y.bar2 (phase2)=0.15 defect boards.
*

15/80=0.1875

> Two phase sampling assumes some relation between the easily/ fast

*> received x-information and the elaborate/ time-consuming y-information,
**> say a ratio r=sum y (phase2)/ sum x (phase2)=15/10=1.5 (out of the above
**> table)
*

Not quite. Two-phase sampling is *useful* only where there is a relationship. No relationship is *assumed*.

There are two ways you can take advantage of a relationship. The first is to stratify the phase-two sampling based on phase one information. In this case you need a strata= argument to twophase().

The second way to use a relationship is to calibrate phase two to phase one, using the calibrate() function. This is analogous to the regression estimator you describe.

A good example to look at is in vignette("epi"). This describes a two-phase sample where about 4000 people are in the first stage (a cancer clinical trial) and then the second phase is sampled based on relapse and on disease type ("histology") determined at the local hospital. Disease type is determined more accurately at a central lab for everyone who relapses, everyone whose locally-determined disease type is bad, and 20% of the rest.

There is also an example of calibration, post-stratifying the second phase to the first phase on disease stage, for the same data.

Finally, note that twophase() does not use the unbiased estimator of variance. It uses a modification that is easier to compute for cluster samples, as described in vignette("phase1"). There is no difference if the first phase is sampled from an infinite population (or with replacement), which is the case in vignette("epi").

-thomas

Thomas Lumley Assoc. Professor, Biostatistics tlumley@u.washington.edu University of Washington, Seattle ______________________________________________R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Tue Jun 06 03:57:45 2006

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.1.8, at Tue 06 Jun 2006 - 06:10:35 EST.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*