Re: [R] Design-consistent variance estimate

From: Doran, Harold <HDoran_at_air.org>
Date: Mon, 18 Aug 2008 10:53:14 -0400


Whoops, the final var estimator var(f(Y)) should have N^4 in the denominator not N^2

> -----Original Message-----
> From: r-help-bounces_at_r-project.org
> [mailto:r-help-bounces_at_r-project.org] On Behalf Of Doran, Harold
> Sent: Monday, August 18, 2008 10:47 AM
> To: Stas Kolenikov
> Cc: r-help_at_r-project.org
> Subject: Re: [R] Design-consistent variance estimate
>
> It also turns out that in educational testing, it is rare to
> consider the sampling design and to estimate
> design-consistent standard errors. I appreciate your thoughts
> on this, Stas. As a result, I was able to bring to my mind
> more transparency into what R's survey package as well as SAS
> proc surveymeans are doing. I've copied some minimal latex code below.
> My R code reflecting this latex replicates svymean() and the
> SAS procedures exactly under all conditions that I have
> tested so far for a
> 1 stage cluster sample.
>
> It clearly reduces to a more simple expression when cluster
> sizes are equal.
>
> My hat is off to sampling statisticians, this has got to be a
> lot of fun for you :)
>
> ### LaTeX
>
> \documentclass[12pt]{article}
> \usepackage{bm,geometry}
> \begin{document}
>
> In this scenario, the appropriate procedure is to estimate
> design-consistent standard errors. This is accomplished by
> first defining the ratio estimator of the mean as:
>
> \begin{equation}
> f(Y) = \frac{Y}{N}
> \end{equation}
>
> \noindent where $Y$ is the total of the variable and $N$ is
> the population size. Treating both $Y$ and $N$ as random
> variables, a first-order taylor series expansion of the ratio
> estimator $f(Y)$ can be used to derive the design-consistent
> variance estimator as:
>
> \begin{equation}
> var(f(Y)) = \left[\frac{\partial f(Y)}{\partial Y},
> \frac{\partial f(Y)}{\partial N}\right] \left [ \begin{array}{cc}
> var(Y) & cov(Y,N)\\
> cov(Y,N) & var(N)\\
> \end{array}
> \right]
> \left[\frac{\partial f(Y)}{\partial Y}, \frac{\partial
> f(Y)}{\partial N}\right]^T \end{equation}
>
> \noindent where
>
> \begin{equation}
> \left[\frac{\partial f(Y)}{\partial Y}\right] = \frac{1}{N}
> \end{equation}
>
> \begin{equation}
> \left[\frac{\partial f(Y)}{\partial N}\right] = -
> \frac{Y}{N^2} \end{equation}
>
> \begin{equation}
> var(Y) = \frac{k}{k-1} \sum_{j=1}^k(\hat{Y}_j-\hat{Y}_{..})^2
> \end{equation}
>
> \begin{equation}
> \hat{Y}_j = \sum_{i=1}^{n_j}\hat{Y}_{j(i)} \end{equation}
>
> \begin{equation}
> \hat{Y}_{..} = k^{-1} \sum_{j=1}^k \hat{Y}_j \end{equation}
>
> \begin{equation}
> var(N) = \frac{k}{k-1} \sum_{j=1}^k(\hat{N}_j-\hat{N}_{..})^2
> \end{equation}
>
> \begin{equation}
> \hat{N}_j = \sum_{i=1}^{n_j}\hat{N}_{j(i)} \end{equation}
>
> \begin{equation}
> \hat{N}_{..} = k^{-1} \sum_{j=1}^k \hat{N}_j \end{equation}
>
> \begin{equation}
> cov(Y,N) = \sum_{j=1}^k(\hat{Y}_j- \hat{Y}_{..}) (\hat{N}_j-
> \hat{N}_{..}) \times \frac{k}{k-1}
> \end{equation}
>
> \noindent where $j$ indexes cluster $(1, 2, \ldots, k)$,
> $j(i)$ indexes the $i$th member of cluster $j$, and $n_j$ is
> the total number of members in cluster $j$.
>
> The estimate of the variance of $f(Y)$ is then taken as:
>
> \begin{equation}
> var(f(Y)) = \frac{N^2var(Y) - 2cov(Y,N)NY + var(N)Y^2 }{N^2}
> \end{equation}
>
> The standard error is then taken as:
>
> \begin{equation}
> se = \sqrt{var(f(Y))}
> \end{equation}
>
> \end{document}
>
> > -----Original Message-----
> > From: Stas Kolenikov [mailto:skolenik_at_gmail.com]
> > Sent: Monday, August 18, 2008 10:40 AM
> > To: Doran, Harold
> > Cc: r-help_at_r-project.org
> > Subject: Re: [R] Design-consistent variance estimate
> >
> > On 8/16/08, Doran, Harold <HDoran_at_air.org> wrote:
> > > In terms of the "design" (which is a term used loosely)
> the schools
> > > were not randomly selected. They volunteered to participate
> > in a pilot study.
> >
> > Oh, that's a next level of disaster, then! You may have to
> work with
> > treatment effect models, of which there are many:
> > propensity score matching, nearest neighbor matching, instrumental
> > variables, etc.
> > Those methods require asymptotics in terms of number of treatment
> > units, which would be schools -- and I would imagine those are
> > numbered in dozens rather than thousands in your study, so
> > straightforward application of those methods might be problematic...
> > At least I would augment my analysis with propensity score weights:
> > somehow estimate the (school level) probability of participating in
> > the study (I imagine you have the school characteristics at
> hand for
> > your complete universe of schools
> > -- principal's education level, # of computers per student,
> fraction
> > free/reduced price lunch, whatever...
> > you probably know those better than I do :) ), and use
> inverse of that
> > probability as the probability weight. If the selection was
> > informative, you might see quite different results in weighted and
> > unweighted analysis.
> >
> > > In Wolter (1985) he shows the variance of a cluster sample with a
> > > single strata and then extends that to the more general
> example. It
> > > turns out though in many educational assessment studies,
> the single
> > > stage cluster sample is a norm and not so rare.
> >
> > I can see why. Thanks, I'll keep educational statistics examples in
> > mind for those kinds of designs!
> >
> > --
> > Stas Kolenikov, also found at http://stas.kolenikov.name
> Small print:
> > I use this email account for mailing lists only.
> >
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 18 Aug 2008 - 15:04:06 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 18 Aug 2008 - 15:33:52 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive