From: Denis Chabot <chabotd_at_globetrotter.net>

Date: Fri 30 Sep 2005 - 05:55:35 EST

Date: Fri 30 Sep 2005 - 05:55:35 EST

OK, I think I understand better but still have two points to clarify.

My second point is the difference between models fitted by packages gam and mgcv. Sure, some of you have said "different algorithms". And when I specify dfs, shouldn't P-values be very similar for the 2 packages? If not, what does it say of the confidence we can have in the models?

> library(gam)

Loading required package: splines

> data(kyphosis) > kyp1 <- gam(Kyphosis ~ s(Number, 3), family=binomial, data=kyphosis) > summary.gam(kyp1)

Call: gam(formula = Kyphosis ~ s(Number, 3), family = binomial, data
= kyphosis)

Deviance Residuals:

Min 1Q Median 3Q Max -1.3646 -0.6233 -0.4853 -0.3133 2.0965

(Dispersion Parameter for binomial family taken to be 1)

Null Deviance: 83.2345 on 80 degrees of freedom
Residual Deviance: 71.9973 on 76.9999 degrees of freedom
**AIC: 79.9976
**
Number of Local Scoring Iterations: 7

DF for Terms and Chi-squares for Nonparametric Effects

Df Npar Df Npar Chisq P(Chi)
(Intercept) 1

s(Number, 3) 1 2 1.37149 0.50375

> detach(package:gam)

> library(mgcv)

This is mgcv 1.3-7

> kyp2 <- gam(Kyphosis ~ s(Number, k=4, fx=T), family=binomial,
data=kyphosis)

> summary.gam(kyp2)

Family: binomial

Link function: logit

Formula:

Kyphosis ~ s(Number, k = 4, fx = T)

Parametric coefficients:

Estimate Std. Error z value Pr(>|z|) (Intercept) -1.5504 0.3342 -4.64 3.49e-06 ***

--- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Approximate significance of smooth terms: edf Est.rank Chi.sq p-value s(Number) 3 3 8.898 0.0307 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 R-sq.(adj) = 0.101 Deviance explained = 12.5% UBRE score = 0.075202 Scale est. = 1 n = 81 > kyp2$deviance [1] 72.85893 > kyp2$null.deviance [1] 83.23447 > kyp2$df.null [1] 80 > kyp2$df.residual [1] 77 How can we explain this huge difference? DenisReceived on Fri Sep 30 06:07:20 2005

> Le 05-09-29 à 06:00, r-help-request@stat.math.ethz.ch a écrit :>> De : Henric Nilsson <henric.nilsson@statisticon.se>> Date : 29 septembre 2005 03:55:19 HAE> À : ym@climpact.com> Cc : r-help@stat.math.ethz.ch> Objet : Rép : [R] p-level in packages mgcv and gam> Répondre à : Henric Nilsson <henric.nilsson@statisticon.se>>>

> Yves Magliulo said the following on 2005-09-28 17:05:

>>> hi,>> i'll try to help you, i send a mail about this subject last>> week... and>> i did not have any response...>> I'm using gam from package mgcv. 1)>> How to interpret the significance of smooth terms is hard for me to>> understand perfectly : using UBRE, you fix df. p-value are>> estimated by chi-sq distribution using GCV, the best df are>> estimated by GAM. (that's what i want) and>> p-values>>> This is not correct. The df are estimated in both cases (i.e. UBRE> and GCV), but the scale parameter is fixed in the UBRE case. Hence,> by default UBRE is used for family = binomial or poisson since the> scale parameter is assumed to be 1. Similarly, GCV is the default> for family = gaussian since we most often want the scale (usually> denoted sigma^2) to be estimated.>>>> are estimated by an F distribution But in that case they said "use at>> your own risk" in ?summary.gam>>> The warning applies in both cases. The p-values are conditional on> the smoothing parameters, and the uncertainty of the smooths is not> taken into account when computing the p-values.>>>> so you can also look at the chi.sq : but i don't know how to choose a>>> No...>>>> criterion like for p-values... for me, chi.sq show the best>> predictor in>> a model, but it's hard to reject one with it.>>> Which version of mgcv do you use? The confusion probably stems from> earlier versions of mgcv (< 1.3-5): the summary and anova methods> used to have a column denoted Chi.sq even when the displayed> statistic was computed as F. Recent versions of mgcv has>> > summary(b)>> Family: gaussian> Link function: identity>> Formula:> y ~ s(x0) + s(x1) + s(x2) + s(x3)>> Parametric coefficients:> Estimate Std. Error t value Pr(>|t|)> (Intercept) 7.9150 0.1049 75.44 <2e-16 ***> ---> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1>> Approximate significance of smooth terms:> edf Est.rank F p-value> s(x0) 5.173 9.000 3.785 0.000137 ***> s(x1) 2.357 9.000 34.631 < 2e-16 ***> s(x2) 8.517 9.000 84.694 < 2e-16 ***> s(x3) 1.000 1.000 0.444 0.505797> ---> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1>> R-sq.(adj) = 0.726 Deviance explained = 73.7%> GCV score = 4.611 Scale est. = 4.4029 n = 400>>> If we assume that the scale is known and fixed at 4.4029, we get>> > summary(b, dispersion = 4.4029)>> Family: gaussian> Link function: identity>> Formula:> y ~ s(x0) + s(x1) + s(x2) + s(x3)>> Parametric coefficients:> Estimate Std. Error z value Pr(>|z|)> (Intercept) 7.9150 0.1049 75.44 <2e-16 ***> ---> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1>> Approximate significance of smooth terms:> edf Est.rank Chi.sq p-value> s(x0) 5.173 9.000 34.067 8.7e-05 ***> s(x1) 2.357 9.000 311.679 < 2e-16 ***> s(x2) 8.517 9.000 762.255 < 2e-16 ***> s(x3) 1.000 1.000 0.444 0.505> ---> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1>> R-sq.(adj) = 0.726 Deviance explained = 73.7%> GCV score = 4.611 Scale est. = 4.4029 n = 400>> Note that t/F changed into z/Chi.sq.>>>> so as far as i m concerned, i use GCV methods, and fix a 5% on the>> null>> hypothesis (pvalue) to select significant predictor. after, i look>> at my>> smooth, and if the parametrization look fine to me, i validate.>> generaly, for p-values smaller than 0.001, you can be confident. over>> 0.001, you have to check. 2)>> for difference between package gam and mgcv, i sent a mail about this>>> The underlying algorithms are very different.>>> HTH,> Henric>>> De : "Liaw, Andy" <andy_liaw@merck.com>> Date : 28 septembre 2005 14:01:25 HAE> À : 'Denis Chabot' <chabotd@globetrotter.net>, Peter Dalgaard> <p.dalgaard@biostat.ku.dk>> Cc : Thomas Lumley <tlumley@u.washington.edu>, R list <r-> help@stat.math.ethz.ch>> Objet : RE: [R] p-level in packages mgcv and gam>> Just change the df in what Thomas described to the degree of> polynomial, and> everything he said still applies. Any good book on regression that> covers> polynomial regression ought to point this out.>> Andy>>>> From: Denis Chabot>>>> But what about another analogy, that of polynomials? You may not be>> sure what degree polynomial to use, and you have not decided before>> analysing your data. You fit different polynomials to your data,>> checking if added degrees increase r2 sufficiently by doing F-tests.>>>> I thought it was the same thing with GAMs. You can fit a>> model with 4>> df, and in some cases it is of interest to see if this is a better>> fit than a linear fit. But why can't you also check if 7df is better>> than 4df? And if you used mgcv first and it tells you that 7df is>> better than 4df, why bother repeating the comparison 7df>> against 4df,>> why not just take the p-value for the model with 7df (fixed)?>>>> Denis>

> De : Peter Dalgaard <p.dalgaard@biostat.ku.dk>> Date : 28 septembre 2005 12:04:58 HAE> À : Thomas Lumley <tlumley@u.washington.edu>> Cc : Denis Chabot <chabotd@globetrotter.net>, R list <r-> help@stat.math.ethz.ch>> Objet : Rép : [R] p-level in packages mgcv and gam>

> Thomas Lumley <tlumley@u.washington.edu> writes:

>>>> Bob, on the other hand, chooses the amount of smoothing depending on>> the data. When a 4 df smooth fits best he ends up with the same model>> as Alice and the same p-value. When some other df fits best he ends>> up with a different model and a *smaller* p-value than Alice.>> This doesn't actually follow, unless the p-value (directly or> indirectly) found its way into the definition of "best fit". It does> show the danger, though.>> --> O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B> c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K> (*) \(*) -- University of Copenhagen Denmark Ph: (+45)> 35327918> ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45)> 35327907>

> De : Thomas Lumley <tlumley@u.washington.edu>> Date : 26 septembre 2005 12:54:43 HAE> À : Denis Chabot <chabotd@globetrotter.net>> Cc : r-help@stat.math.ethz.ch> Objet : Rép : [R] p-level in packages mgcv and gam>

> On Mon, 26 Sep 2005, Denis Chabot wrote:

>> But the mgcv manual warns that p-level for the smooth can be> underestimated when df are estimated by the model. Most of the time> my p-levels are so small that even doubling them would not result in> a value close to the P=0.05 threshold, but I have one case with> P=0.033.>> I thought, probably naively, that running a second model with fixed> df, using the value of df found in the first model. I could not> achieve this with mgcv: its gam function does not seem to accept> fractional values of df (in my case 8.377).>> No, this won't work. The problem is the usual one with model> selection: the p-value is calculated as if the df had been fixed,> when really it was estimated.>> It is likely to be quite hard to get an honest p-value out of> something that does adaptive smoothing.>> -thomas> De : Thomas Lumley <tlumley@u.washington.edu>> Date : 28 septembre 2005 11:33:27 HAE> À : Denis Chabot <chabotd@globetrotter.net>> Cc : R list <r-help@stat.math.ethz.ch>> Objet : Rép : [R] p-level in packages mgcv and gam>> On Wed, 28 Sep 2005, Denis Chabot wrote:>> I only got one reply to my message:>> No, this won't work. The problem is the usual one with model> selection: the p-value is calculated as if the df had been fixed,> when really it was estimated.>> It is likely to be quite hard to get an honest p-value out of> something that does adaptive smoothing.>> -thomas>> I do not understand this: it seems that a lot of people chose df=4> for no particular reason, but p-levels are correct. If instead I> choose df=8 because a previous model has estimated this to be an> optimal df, P-levels are no good because df are estimated?>> Yes. I know this sounds strange initially, but it really does make> sense if you think about it carefully.>> Suppose that Alice and Bob are kyphosis researchers, and that Alice> always chooses 4df for smoothing Age. We would all agree that her> p-values are correct [in fact we wouldn't, but that is a separate> issue]>> Bob, on the other hand, chooses the amount of smoothing depending> on the data. When a 4 df smooth fits best he ends up with the same> model as Alice and the same p-value. When some other df fits best> he ends up with a different model and a *smaller* p-value than Alice.>> In particular, this is still true under the null hypothesis that> Age has no effect [If Alice and Bob are interested in p-values, the> null hypothesis must be plausible.]>> If Bob's p-values are always less than or equal to Alice's p-values> under the null hypothesis, and Alice's p-values are less than 0.05> 5% of the time, then Bob's p-values are less than 0.05 more than 5%> of the time.>>> -thomas>>> Furthermore, shouldn't packages gam and mgcv give similar results> when the same data and df are used? I tried this:>> library(gam)> data(kyphosis)> kyp1 <- gam(Kyphosis ~ s(Age, 4), family=binomial, data=kyphosis)> kyp2 <- gam(Kyphosis ~ s(Number, 4), family=binomial, data=kyphosis)> kyp3 <- gam(Kyphosis ~ s(Start, 4), family=binomial, data=kyphosis)> anova.gam(kyp1)> anova.gam(kyp2)> anova.gam(kyp3)>> detach(package:gam)> library(mgcv)> kyp4 <- gam(Kyphosis ~ s(Age, k=4, fx=T), family=binomial,> data=kyphosis)> kyp5 <- gam(Kyphosis ~ s(Number, k=4, fx=T), family=binomial,> data=kyphosis)> kyp6 <- gam(Kyphosis ~ s(Start, k=4, fx=T), family=binomial,> data=kyphosis)> anova.gam(kyp4)> anova.gam(kyp5)> anova.gam(kyp6)>>> P levels for these models, by pair>> kyp1 vs kyp4: p= 0.083 and 0.068 respectively (not too bad)> kyp2 vs kyp5: p= 0.445 and 0.03 (wow!)> kyp3 vs kyp6: p= 0.053 and 0.008 (wow again)>> Also if you plot all these you find that the mgcv plots are smoother> than the gam plots, even the same df are used all the time.>> I am really confused now!>> Denis>

>

> Thomas Lumley Assoc. Professor, Biostatistics

> tlumley@u.washington.edu University of Washington, SeattleDe :> Thomas Lumley <tlumley@u.washington.edu>> Date : 28 septembre 2005 14:35:26 HAE> À : Denis Chabot <chabotd@globetrotter.net>> Cc : Peter Dalgaard <p.dalgaard@biostat.ku.dk>, R list <r-> help@stat.math.ethz.ch>> Objet : Rép : [R] p-level in packages mgcv and gam>> On Wed, 28 Sep 2005, Denis Chabot wrote:>> But what about another analogy, that of polynomials? You may not be> sure what degree polynomial to use, and you have not decided before> analysing your data. You fit different polynomials to your data,> checking if added degrees increase r2 sufficiently by doing F-tests.>> Yes, you can. And this procedure gives you incorrect p-values.>> They may not be very incorrect -- it depends on how much model> selection you do, and how strongly the feature you are selecting on> is related to the one you are testing.>> For example, using step() to choose a polynomial in x even when x> is unrelated to y and z inflates the Type I error rate by giving a> biased estimate of the residual mean squared error:>> once<-function(){> y<-rnorm(50);x<-runif(50);z<-rep(0:1,25)> summary(step(lm(y~z),> scope=list(lower=~z,upper=~z+x+I(x^2)+I(x^3)+I(x^4)),> trace=0))$coef["z",4]> }> p<-replicate(1000,once())> mean(p<0.05)> [1] 0.072>> which is significantly higher than you would expect for an honest> level 0.05 test.>> -thomas

______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

*
This archive was generated by hypermail 2.1.8
: Fri 03 Mar 2006 - 03:40:34 EST
*