From: hadley wickham <h.wickham_at_gmail.com>

Date: Wed, 09 May 2007 08:21:11 +0200

R-help_at_stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 09 May 2007 - 06:25:28 GMT

Date: Wed, 09 May 2007 08:21:11 +0200

Thanks John,

That's just the explanation I was looking for. I had hoped that there would be a built in way of dealing with them with R, but obviously not.

Given that explanation, it stills seems to me that the way R calculates n is suboptimal, as demonstrated by my second example:

summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50)))

the weights are only very slightly different but the estimates of residual standard error are quite different (20 vs 14 in my run)

Hadley

On 5/8/07, John Fox <jfox_at_mcmaster.ca> wrote:

> Dear Hadley,

*>
**> I think that the problem is that the term "weights" has different meanings,
**> which, although they are related, are not quite the same.
**>
**> The weights used by lm() are (inverse-)"variance weights," reflecting the
**> variances of the errors, with observations that have low-variance errors
**> therefore being accorded greater weight in the resulting WLS regression.
**> What you have are sometimes called "case weights," and I'm unaware of a
**> general way of handling them in R, although you could regenerate the
**> unaggregated data. As you discovered, you get the same coefficients with
**> case weights as with variance weights, but different standard errors.
**> Finally, there are "sampling weights," which are inversely proportional to
**> the probability of selection; these are accommodated by the survey package.
**>
**> To complicate matters, this terminology isn't entirely standard.
**>
**> I hope this helps,
**> John
**>
**> --------------------------------
**> John Fox, Professor
**> Department of Sociology
**> McMaster University
**> Hamilton, Ontario
**> Canada L8S 4M4
**> 905-525-9140x23604
**> http://socserv.mcmaster.ca/jfox
**> --------------------------------
**>
**> > -----Original Message-----
**> > From: r-help-bounces_at_stat.math.ethz.ch
**> > [mailto:r-help-bounces_at_stat.math.ethz.ch] On Behalf Of hadley wickham
**> > Sent: Tuesday, May 08, 2007 5:09 AM
**> > To: R Help
**> > Subject: [R] Weighted least squares
**> >
**> > Dear all,
**> >
**> > I'm struggling with weighted least squares, where something
**> > that I had assumed to be true appears not to be the case.
**> > Take the following data set as an example:
**> >
**> > df <- data.frame(x = runif(100, 0, 100)) df$y <- df$x + 1 +
**> > rnorm(100, sd=15)
**> >
**> > I had expected that:
**> >
**> > summary(lm(y ~ x, data=df, weights=rep(2, 100))) summary(lm(y
**> > ~ x, data=rbind(df,df)))
**> >
**> > would be equivalent, but they are not. I suspect the
**> > difference is how the degrees of freedom is calculated - I
**> > had expected it to be sum(weights), but seems to be
**> > sum(weights > 0). This seems unintuitive to me:
**> >
**> > summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50)))
**> > summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50)))
**> >
**> > What am I missing? And what is the usual way to do a linear
**> > regression when you have aggregated data?
**> >
**> > Thanks,
**> >
**> > Hadley
**> >
**> > ______________________________________________
**> > R-help_at_stat.math.ethz.ch mailing list
**> > https://stat.ethz.ch/mailman/listinfo/r-help
**> > PLEASE do read the posting guide
**> > http://www.R-project.org/posting-guide.html
**> > and provide commented, minimal, self-contained, reproducible code.
**> >
**>
**>
**>
*

R-help_at_stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 09 May 2007 - 06:25:28 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Wed 09 May 2007 - 12:31:31 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*