Re: [R] {Spam?} Re: mgcv, testing gamm vs lme, which degrees of freedom?

From: Joris Meys <jorismeys_at_gmail.com>
Date: Mon, 21 Jun 2010 21:08:48 +0200

Hi Carlo,

You should get the book of Simon Wood and read it thoroughly. It's all explained in there, but it will lead me too far to copy it all in a mail. In short : random effects are part of the error structure of the model, not of the model itself. They're added to correct the error on the parameters of the fixed model, and are inherent to the data structure and not to the hypotheses. Hence, you rarely test their significance.

Cheers
Joris

On Mon, Jun 21, 2010 at 6:54 PM, Carlo Fezzi <c.fezzi_at_uea.ac.uk> wrote:
> Hi Joris (CC Simon),
>
> Thanks for your kind replies and for being so responsive.
>
> I think this post boils down to two main questions (which I feel are very
> important for gams modelling):
>
> 1- Is it appropriate to use LR tests in "gamm" to test model reduction?
> 2- If yes, which degrees of freedom should be used?
>
> I do not think we should always use the df from "model$lme". For example,
> compare the two models (again my first example for the data generation):
>
> f1 <- gamm(y ~ s(x, k=2, bs="cr"), random = list(id=~1), method="ML" )
> f2 <- gamm(y ~ s(x, k=10, bs="cr"), random = list(id=~1), method="ML" )
>
> The difference between the two models is in the random effects. Model "f2"
> has, if I interpreet correctly the output, 7 random effects more than the
> model "f1", but the fixed effects are the same. So the H0 = "the 7 random
> effect are not significant". In this case the (app.) likelihood ratio test
> should have 7 df... is my interpretation correct?
>
> On the other hand, to compare the following models:
>
> f3 <- gamm(y ~ x + I(x^2), random = list(id=~1), method="ML" )
> f2 <- gamm(y ~ s(x, k=10, bs="cr"), random = list(id=~1), method="ML" )
>
> Model "f3" has 1 more fixed effect than model "f2", but model "f2" has 7
> more random effects... again, if I understand correctly the output. In
> this case I don't know if we can do a LR test, the model are not strictly
> nested I think...
>
> What do you think?
>
> Again many thanks,
>
> Carlo
>
>
>
>> I don't use an LR test for non-nested models, as I fail to formulate a
>> sensible null hypothesis for such tests. Again, everything I write is
>> a personal opinion, and inference in the case of these models is still
>> subject of discussion to date. If you find a plausible way for
>> explaining the result, by all means use the LR test.
>>
>> Personally, I'd go for the AIC / BIC, but these are based on the
>> likelihood themselves. So in the case where the effective complexity
>> of the model appears the same, they're completely equivalent to the
>> likelihood. It's just the inference (i.e. the p-value) I don't trust.
>> But then again, I'm a cautious statistician. If I'm not sure about a
>> method, I'd rather don't use it and go with what I know. In my view,
>> there is not one correct method for a particular problem and/or
>> dataset. Every method makes assumptions and has shortcomings. Only if
>> I know which ones, I can take them into account when interpreting the
>> results.
>>
>> It also depends on the focus as well. If the focus is prediction, you
>> might even want to consider testing whether the variance of the
>> residuals differs significantly with a simple F-test. This indicates
>> whether the predictive power differs significantly between the models.
>> But these tests tend to get very sensitive when you have many
>> datapoints, rendering them practically useless again.
>>
>> So in the end, it always boils down to interpretation.
>>
>> Cheers
>> Joris
>>
>> On Fri, Jun 18, 2010 at 10:29 PM, Carlo Fezzi <c.fezzi_at_uea.ac.uk> wrote:
>>> Thanks Joris,
>>>
>>> I understand your point regarding the need for the two models to be
>>> nested. So, according to your in the example case the LR test is not
>>> appropriate and the two model should be compared with other criteria
>>> such
>>> as AIC or BIC for example.
>>>
>>> On the other hand, Simon Wood indicated that such a LR test is
>>> (approximately) correct in his previous reply... a am bit confused,
>>> which
>>> is the correct approach to test the two models? Is the LR test correct
>>> only if the parametric model is linear in the x variables maybe? In this
>>> case, which is the best appraoch to compare a "gamm" vs a "lme" with
>>> quadratic specification?
>>>
>>> Best wishes,
>>>
>>> Carlo
>>>
>>>> Just realized something: You should take into account that the LR test
>>>> is actually only valid for _nested_ models. Your models are not
>>>> nested. Hence, you shouldn't use the anova function to compare them,
>>>> and you shouldn't compare the df. In fact, if you're interested in the
>>>> contribution of a term, then using anova to compare the model with
>>>> that term and without that term gives you an answer on the hypothesis
>>>> whether that term with spline contributes significantly to the model.
>>>>
>>>>> f2 <- gamm(y ~ s(x), random = list(id=~1), method="ML")
>>>>
>>>>> f3 <- gamm(y ~ x, random = list(id=~1), method="ML" )
>>>>
>>>>> f4 <- gamm(y ~ 1, random = list(id=~1), method="ML" )
>>>>
>>>>> anova(f3$lme,f2$lme)
>>>>        Model df AIC BIC logLik   Test L.Ratio p-value
>>>> f3$lme     1  4 760 770   -376
>>>> f2$lme     2  5 381 394   -186 1 vs 2     380  <.0001
>>>>
>>>>> anova(f4$lme,f2$lme)
>>>>        Model df AIC BIC logLik   Test L.Ratio p-value
>>>> f4$lme     1  3 945 953   -470
>>>> f2$lme     2  5 381 394   -186 1 vs 2     568  <.0001
>>>>
>>>>> anova(f3$lme,f4$lme)
>>>>        Model df AIC BIC logLik   Test L.Ratio p-value
>>>> f3$lme     1  4 760 770   -376
>>>> f4$lme     2  3 945 953   -470 1 vs 2     188  <.0001
>>>>
>>>> This is the correct application of a likelihood ratio test. You see
>>>> that adding the spline increases the df with 1 compared to the linear
>>>> model, as part of the spline gets into the random component. Notice as
>>>> well that the interpretation of a test in case of a random component
>>>> is not the same as in case of a fixed component. If I understood
>>>> correctly, this LR test specifically says something over the effect of
>>>> X, without being interested in the shape of the spline. The
>>>> "significance of a spline" is a difficult concept anyway, as a spline
>>>> can be seen as a form of local regression. It's exactly the use of the
>>>> randomization that allows for a general hypothesis about the added
>>>> value of the spline, without focusing on its actual shape. Hence the
>>>> "freedom" connected to that actual shape should not be used in the df
>>>> used to test the general hypothesis.
>>>>
>>>> Hope this makes sense someway...
>>>>
>>>> Cheers
>>>> Joris
>>>>
>>>>
>>>> On Fri, Jun 18, 2010 at 6:27 PM, Carlo Fezzi <c.fezzi_at_uea.ac.uk> wrote:
>>>>> Dear Simon,
>>>>>
>>>>> thanks a lot for your prompt reply.
>>>>>
>>>>> Unfortunately I am still confused about which is the correct way to
>>>>> test
>>>>> the two models... as you point out: why in my example the two models
>>>>> have
>>>>> the same degrees of freedom?
>>>>>
>>>>> Intuitively it seems to me the gamm model is more flexible since, as I
>>>>> understand also from you response, it should contain more random
>>>>> effects
>>>>> than the other model because some of the smooth function parameters
>>>>> are
>>>>> represented as such. This should not be taken into account when
>>>>> testing
>>>>> one model vs the other?
>>>>>
>>>>> Continuing with my example, the two models:
>>>>>
>>>>> f2 <- gamm(y ~ s(x), random = list(id=~1), method="ML")
>>>>> f3 <- gamm(y ~ x + I(x^2), random = list(id=~1), method="ML" )
>>>>>
>>>>> Can be tested with:
>>>>>
>>>>> anova(f3$lme,f2$lme)
>>>>>
>>>>> But why are the df the same? Model f2 appears to be more flexible and,
>>>>> as
>>>>> such, should have more (random) parameters. Should not a test of one
>>>>> model
>>>>> vs the other take this into account?
>>>>>
>>>>> Sorry if this may sound dull, many thanks for your help,
>>>>>
>>>>> Carlo
>>>>>
>>>>>
>>>>>
>>>>>> On Wednesday 16 June 2010 20:33, Carlo Fezzi wrote:
>>>>>>> Dear all,
>>>>>>>
>>>>>>> I am using the "mgcv" package by Simon Wood to estimate an additive
>>>>>>> mixed
>>>>>>> model in which I assume normal distribution for the residuals. I
>>>>>>> would
>>>>>>> like to test this model vs a standard parametric mixed model, such
>>>>>>> as
>>>>>>> the
>>>>>>> ones which are possible to estimate with "lme".
>>>>>>>
>>>>>>> Since the smoothing splines can be written as random effects, is it
>>>>>>> correct to use an (approximate) likelihood ratio test for this?
>>>>>> -- yes this is ok (subject to the usual caveats about testing on the
>>>>>> boundary
>>>>>> of the parameter space) but your 2 example models below will have
>>>>>>  the
>>>>>> same
>>>>>> number of degrees of freedom!
>>>>>>
>>>>>>> If so,
>>>>>>> which is the correct number of degrees of freedom?
>>>>>> --- The edf from the lme object, if you are testing using the log
>>>>>> likelihood
>>>>>> returned by the  lme representation of the model.
>>>>>>
>>>>>>> Sometime the function
>>>>>>> LogLik() seems to provide strange results regarding the number of
>>>>>>> degrees
>>>>>>> of freedom (df) for the gam, for instance in the example I copied
>>>>>>> below
>>>>>>> the df for the "gamm" are equal to the ones for the "lme", but the
>>>>>>> summary(model.gam) seems to indicate a much higher edf for the gamm.
>>>>>> --- the edf for the lme representation of the model counts only the
>>>>>> fixed
>>>>>> effects + the variance parameters (which includes smoothing
>>>>>> parameters).
>>>>>> Each
>>>>>> smooth typically contributes only one or two fixed effect parameters,
>>>>>> with
>>>>>> the rest of the coefficients for the smooth treated as random
>>>>>> effects.
>>>>>>
>>>>>> --- the edf for the gam representation of the same model differs in
>>>>>> that
>>>>>> it
>>>>>> also counts the *effective* number of parameters used to represent
>>>>>> each
>>>>>> smooth: this includes contributions from all those coefficients that
>>>>>> the
>>>>>> lme
>>>>>> representation treated as strictly random.
>>>>>>
>>>>>> best,
>>>>>> Simon
>>>>>>
>>>>>>
>>>>>>> I would be very grateful to anybody who could point out a solution,
>>>>>>>
>>>>>>> Best wishes,
>>>>>>>
>>>>>>> Carlo
>>>>>>>
>>>>>>> Example below:
>>>>>>>
>>>>>>> ----
>>>>>>>
>>>>>>> rm(list = ls())
>>>>>>> library(mgcv)
>>>>>>> library(nlme)
>>>>>>>
>>>>>>> set.seed(123)
>>>>>>>
>>>>>>> x  <- runif(100,1,10)                                # regressor
>>>>>>> b0 <- rep(rnorm(10,mean=1,sd=2),each=10)     # random intercept
>>>>>>> id <- rep(1:10, each=10)                     # identifier
>>>>>>>
>>>>>>> y <- b0 + x - 0.1 * x^3 + rnorm(100,0,1)  # dependent variable
>>>>>>>
>>>>>>> f1 <- lme(y ~ x + I(x^2), random = list(id=~1) , method="ML" )  #
>>>>>>> lme
>>>>>>> model
>>>>>>>
>>>>>>> f2 <- gamm(y ~ s(x), random = list(id=~1), method="ML" )    # gamm
>>>>>>>
>>>>>>> ## same number of "df" according to logLik:
>>>>>>> logLik(f1)
>>>>>>> logLik(f2$lme)
>>>>>>>
>>>>>>> ## much higher edf according to summary:
>>>>>>> summary(f2$gam)
>>>>>>>
>>>>>>> -----------
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help_at_r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>>> http://www.R-project.org/posting-guide.html and provide commented,
>>>>>>> minimal,
>>>>>>> self-contained, reproducible code.
>>>>>>
>>>>>> --
>>>>>>> Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2 7AY
>>>>>>> UK
>>>>>>> +44 1225 386603  www.maths.bath.ac.uk/~sw283
>>>>>>
>>>>>
>>>>> ______________________________________________
>>>>> R-help_at_r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Joris Meys
>>>> Statistical consultant
>>>>
>>>> Ghent University
>>>> Faculty of Bioscience Engineering
>>>> Department of Applied mathematics, biometrics and process control
>>>>
>>>> tel : +32 9 264 59 87
>>>> Joris.Meys_at_Ugent.be
>>>> -------------------------------
>>>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> Joris Meys
>> Statistical consultant
>>
>> Ghent University
>> Faculty of Bioscience Engineering
>> Department of Applied mathematics, biometrics and process control
>>
>> tel : +32 9 264 59 87
>> Joris.Meys_at_Ugent.be
>> -------------------------------
>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>>
>
>
>

-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
Joris.Meys_at_Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Mon 21 Jun 2010 - 19:11:11 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 21 Jun 2010 - 20:32:06 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive