From: Adaikalavan Ramasamy <ramasamy_at_cancer.org.uk>

Date: Tue, 08 May 2007 15:12:44 +0100

*>
*

> That is one type of weighting - but what if I have already aggregated

*> data? That is a perfectly valid type of weighting too.
*

*>
*

*>
*

> Hardly artificially - I have repeated observations.

*>
*

*>
*

> There is no within group variation - these are observations that occur

*> with same values many times in the dataset, so have been aggregated
*

*> into the a contingency table-like format.
*

*>
*

*>
*

> I'm not interested in a mixed model, and I don't have individual data

*> points.
*

*>
*

*> Hadley
*

*>
*

*>
*

>

R-help_at_stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Tue 08 May 2007 - 14:31:07 GMT

Date: Tue, 08 May 2007 15:12:44 +0100

Sorry, you did not explain that your weights correspond to your frequency in the original post. I assumed they were repeated measurements with within group variation.

I was merely responding to your query why the following differed.

summary(lm(y ~ x, data=df, weights=rep(2, 100))) summary(lm(y ~ x, data=rbind(df,df)))

Let me also clarify my statement about "artificial". If one treats repeated observations as independent, then they obtain estimates with inflated precision. I was not calling your data artificial in any way.

Using frequency as weights may be valid. Your data points appear to arise from discrete distribution, so I am not entirely sure if you can use the linear model which assumes the errors are normally distributed.

Regards, Adai

hadley wickham wrote:

> On 5/8/07, Adaikalavan Ramasamy <ramasamy@cancer.org.uk> wrote:

>> See below. >> >> hadley wickham wrote: >> > Dear all, >> > >> > I'm struggling with weighted least squares, where something that I had >> > assumed to be true appears not to be the case. Take the following >> > data set as an example: >> > >> > df <- data.frame(x = runif(100, 0, 100)) >> > df$y <- df$x + 1 + rnorm(100, sd=15) >> > >> > I had expected that: >> > >> > summary(lm(y ~ x, data=df, weights=rep(2, 100))) >> > summary(lm(y ~ x, data=rbind(df,df))) >> >> You assign weights to different points according to some external >> quality or reliability measure not number of times the data point was >> measured.

> That is one type of weighting - but what if I have already aggregated

>> Look at the estimates and standard error of the two models below: >> >> coefficients( summary(f.w <- lm(y ~ x, data=df, weights=rep(2, 100))) ) >> Estimate Std. Error t value Pr(>|t|) >> (Intercept) 1.940765 3.30348066 0.587491 5.582252e-01 >> x 0.982610 0.05893262 16.673448 2.264258e-30 >> >> coefficients( summary( f.u <- lm(y ~ x, data=rbind(df,df) ) ) ) >> Estimate Std. Error t value Pr(>|t|) >> (Intercept) 1.940765 2.32408609 0.8350659 4.046871e-01 >> x 0.982610 0.04146066 23.6998165 1.012067e-59 >> >> You can see that they have same coefficient estimates but the second one >> has smaller variances. >> >> The repeated values artificially deflates the variance and thus inflates >> the precision. This is why you cannot treat replicate data as >> independent observations.

> Hardly artificially - I have repeated observations.

>> > would be equivalent, but they are not. I suspect the difference is >> > how the degrees of freedom is calculated - I had expected it to be >> > sum(weights), but seems to be sum(weights > 0). This seems >> > unintuitive to me: >> > >> > summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) >> > summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50))) >> > >> > What am I missing? And what is the usual way to do a linear >> > regression when you have aggregated data? >> >> I would be best to use the individual data points instead of aggregated >> data as it allows you to estimate the within-group variations as well.

> There is no within group variation - these are observations that occur

>> If you had individual data points, you could try something as follows. >> Please check the codes as I am no expert in the area of repeated >> measures. >> >> x <- runif(100, 0, 100) >> y1 <- x + rnorm(100, mean=1, sd=15) >> y2 <- y1 + rnorm(100, sd=5) >> >> df <- data.frame( y=c(y1, y2), >> x=c(x,x), >> subject=factor(rep( paste("p", 1:100, sep=""), 2 ) )) >> >> library(nlme) >> summary( lme( y ~ x, random = ~ 1 | subject, data=df ) ) >> >> Try reading Pinheiro and Bates (http://tinyurl.com/yvvrr7) or related >> material for more information. Hope this helps.

> I'm not interested in a mixed model, and I don't have individual data

>

R-help_at_stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Tue 08 May 2007 - 14:31:07 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Tue 08 May 2007 - 15:01:31 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*