# Re: [R] Weighted least squares

Date: Tue, 08 May 2007 15:12:44 +0100

Sorry, you did not explain that your weights correspond to your frequency in the original post. I assumed they were repeated measurements with within group variation.

I was merely responding to your query why the following differed.

summary(lm(y ~ x, data=df, weights=rep(2, 100)))     summary(lm(y ~ x, data=rbind(df,df)))

Let me also clarify my statement about "artificial". If one treats repeated observations as independent, then they obtain estimates with inflated precision. I was not calling your data artificial in any way.

Using frequency as weights may be valid. Your data points appear to arise from discrete distribution, so I am not entirely sure if you can use the linear model which assumes the errors are normally distributed.

> On 5/8/07, Adaikalavan Ramasamy <ramasamy@cancer.org.uk> wrote:

```>> See below.
>>
>> > Dear all,
>> >
>> > I'm struggling with weighted least squares, where something that I had
>> > assumed to be true appears not to be the case.  Take the following
>> > data set as an example:
>> >
>> > df <- data.frame(x = runif(100, 0, 100))
>> > df\$y <- df\$x + 1 + rnorm(100, sd=15)
>> >
>> > I had expected that:
>> >
>> > summary(lm(y ~ x, data=df, weights=rep(2, 100)))
>> > summary(lm(y ~ x, data=rbind(df,df)))
>>
>> You assign weights to different points according to some external
>> quality or reliability measure not number of times the data point was
>> measured.
```

>
> That is one type of weighting - but what if I have already aggregated
> data? That is a perfectly valid type of weighting too.
>
```>> Look at the estimates and standard error of the two models below:
>>
>>   coefficients( summary(f.w <- lm(y ~ x, data=df, weights=rep(2, 100))) )
>>               Estimate Std. Error   t value     Pr(>|t|)
>>   (Intercept) 1.940765 3.30348066  0.587491 5.582252e-01
>>   x           0.982610 0.05893262 16.673448 2.264258e-30
>>
>>   coefficients( summary( f.u <- lm(y ~ x, data=rbind(df,df) ) ) )
>>               Estimate Std. Error    t value     Pr(>|t|)
>>   (Intercept) 1.940765 2.32408609  0.8350659 4.046871e-01
>>   x           0.982610 0.04146066 23.6998165 1.012067e-59
>>
>> You can see that they have same coefficient estimates but the second one
>>   has smaller variances.
>>
>> The repeated values artificially deflates the variance and thus inflates
>> the precision. This is why you cannot treat replicate data as
>> independent observations.
```

>
> Hardly artificially - I have repeated observations.
>
```>> > would be equivalent, but they are not.  I suspect the difference is
>> > how the degrees of freedom is calculated - I had expected it to be
>> > sum(weights), but seems to be sum(weights > 0).  This seems
>> > unintuitive to me:
>> >
>> > summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50)))
>> > summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50)))
>> >
>> > What am I missing?  And what is the usual way to do a linear
>> > regression when you have aggregated data?
>>
>> I would be best to use the individual data points instead of aggregated
>> data as it allows you to estimate the within-group variations as well.
```

>
> There is no within group variation - these are observations that occur
> with same values many times in the dataset, so have been aggregated
> into the a contingency table-like format.
>
```>> If you had individual data points, you could try something as follows.
>> Please check the codes as I am no expert in the area of repeated
>> measures.
>>
>>   x  <- runif(100, 0, 100)
>>   y1 <- x + rnorm(100, mean=1, sd=15)
>>   y2 <- y1 + rnorm(100, sd=5)
>>
>>   df <- data.frame( y=c(y1, y2),
>>                     x=c(x,x),
>>                     subject=factor(rep( paste("p", 1:100, sep=""), 2 ) ))
>>
>>   library(nlme)
>>   summary( lme( y ~ x, random = ~ 1 | subject, data=df ) )
>>
>> Try reading Pinheiro and Bates (http://tinyurl.com/yvvrr7) or related
```

>
> I'm not interested in a mixed model, and I don't have individual data
> points.
>