# Re: [R] Survey - twophase

From: Thomas Lumley <tlumley_at_u.washington.edu>
Date: Tue 06 Jun 2006 - 02:42:19 EST

On Mon, 5 Jun 2006, Mark Hempelmann wrote:

> Dear WizaRds,
>
> I am struggling with the use of twophase in package survey. My goal
> is to compute a simple example in two phase sampling:
>
> phase 1: I sample n1=1000 circuit boards and find 80 non functional
> phase 2: Given the n1=1000 sample I sample n2=100 and find 15 non
> functional. Let's say, phase 2 shows this result together with phase 1:
> ...................phase1........
> ...................ok defunct....
> phase2 ok..........85....0.....85
> .......defunct......5...10.....15
> sum................90...10....100
>
> That is in R:
> fail <- data.frame(id=1:1000 , x=c(rep(0,920), rep(1,80)),
> y=c(rep(0,985), rep(1,15)), n1=rep(1000,1000), n2=rep(100,1000),
> N=rep(5000,1000))
>
> des.fail <- twophase(id=list(~id,~id), data=fail, subset=~I(x==1))
> # fpc=list(~n1,~n2)

The second-phase sample is described by subset=~I(x==1), so you have sampled only 80 in phase two, not 100.

> svymean(~y, des.fail)
>
> gives mean y 0.1875, SE 0.0196, but theoretically,
> we have x.bar1 (phase1)=0.08 and y.bar2 (phase2)=0.15 defect boards.

15/80=0.1875

> Two phase sampling assumes some relation between the easily/ fast
> received x-information and the elaborate/ time-consuming y-information,
> say a ratio r=sum y (phase2)/ sum x (phase2)=15/10=1.5 (out of the above
> table)

Not quite. Two-phase sampling is *useful* only where there is a relationship. No relationship is *assumed*.

There are two ways you can take advantage of a relationship. The first is to stratify the phase-two sampling based on phase one information. In this case you need a strata= argument to twophase().

The second way to use a relationship is to calibrate phase two to phase one, using the calibrate() function. This is analogous to the regression estimator you describe.

A good example to look at is in vignette("epi"). This describes a two-phase sample where about 4000 people are in the first stage (a cancer clinical trial) and then the second phase is sampled based on relapse and on disease type ("histology") determined at the local hospital.   Disease type is determined more accurately at a central lab for everyone who relapses, everyone whose locally-determined disease type is bad, and 20% of the rest.

There is also an example of calibration, post-stratifying the second phase to the first phase on disease stage, for the same data.

Finally, note that twophase() does not use the unbiased estimator of variance. It uses a modification that is easier to compute for cluster samples, as described in vignette("phase1"). There is no difference if the first phase is sampled from an infinite population (or with replacement), which is the case in vignette("epi").

-thomas

```Thomas Lumley			Assoc. Professor, Biostatistics
tlumley@u.washington.edu	University of Washington, Seattle

______________________________________________
```
R-help@stat.math.ethz.ch mailing list