# Re: [R] glm or transformation of the response?

From: joris meys <jorismeys_at_gmail.com>
Date: Tue, 25 Nov 2008 21:10:16 +0100

I reread the question, and I don't actually see the problem too much. If you plot the transformed dataset from most of your models, the problem is quite obvious : if you transform your predictor to the logscale, the result of a a linear regression on those outcomes will be naturally an exponential function. Exact the same thing happens if you use the poisson family in a glm.

Now, there is no reason whatsoever that any exponential model would fit the data, in contrary. The way your data is constructed, is essentially linear. Indeed, in the model, the E(Y) for predictors 1:40 is 1:40, and you would expect a model with intercept 0 and slope 1, whatever your variance does. Actually, the first model fits the data far better than the second one, contrary to your beliefs. On the second one, there is a clear trend in the residuals, indicating an obvious deviation from the true model.

Why model 7 fits equally well, is because it is essentially exactly the same as the classic linear model. The slight deviation comes from the fact that lm uses least squares estimates, and glm uses maximum likelihood.

The problem you create with heteroscedasticity is not one of fitting, but one of confidence intervals for your estimated parameters. So my suggestion to any student faced with this question would be to take a look at the estimates for the variance, and the derived confidence intervals, and try to find a more robust way of determining the confidence intervals on the estimated parameters. Robust regression might be an outcome here.

Kind regards
Joris

On Tue, Nov 25, 2008 at 3:52 PM, Christoph Scherber <Christoph.Scherber_at_agr.uni-goettingen.de> wrote:
> Dear all,
>
> For an introductory course on glm?s I would like to create an example to
> show the difference between glm and transformation of the response. For
> this, I tried to create a dataset where the variance increases with the mean
> (as is the case in many ecological datasets):
>
> poissondata=data.frame(
> response=rpois(40,1:40),
> explanatory=1:40)
>
> attach(poissondata)
>
> However, I have run into a problem because it looks like the lm model (with
> sqrt-transformation) fits the data best:
>
> ##
>
> model1=lm(response~explanatory,poissondata)
> model2=lm(sqrt(response+0.5)~explanatory,poissondata)
> model3=lm(log(response+1)~explanatory,poissondata)
> model4=glm(response~explanatory,poissondata,family=poisson)
> model5=glm(response~explanatory,poissondata,family=quasipoisson)
> model6=glm.nb(response~explanatory,poissondata)
>
>
> plot(explanatory,response,pch=16)
> lines(explanatory,predict(model1,explanatory=explanatory))
> lines(explanatory,(predict(model2,explanatory=explanatory))^2-0.5,lty=2)
> lines(explanatory,exp(predict(model3,explanatory=explanatory))-1,lty=3, col="green")
> lines(explanatory,exp(predict(model5,explanatory=explanatory)),lty=1,col="red")
> lines(explanatory,predict(model6,explanatory=explanatory,type="response"),lty=1,col="blue")
> lines(explanatory,predict(model7,explanatory=explanatory,type="response"),lty=1,col="green")
>
> ##
>
> The only model that performs equally well is model7.
>
> How would you deal with this kind of analysis? What would be your
> recommendation to the students, given the fact that most of the standard glm
> models obviously don?t seem to produce good fits here?
>
> Many thanks and best wishes
> Christoph
>
> (using R 2.8.0 on Windows XP)
>
>
>
>
>
> --
> Dr. rer.nat. Christoph Scherber
> University of Goettingen
> DNPW, Agroecology
> Waldweg 26
> D-37073 Goettingen
> Germany
>
> phone +49 (0)551 39 8807
> fax +49 (0)551 39 8806
>
> Homepage http://www.gwdg.de/~cscherb1
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help