From: Henric Nilsson <henric.nilsson_at_statisticon.se>

Date: Mon 03 Oct 2005 - 01:47:25 EST

*>
*

*>
*

*>
*

*>
*

*>
*

>

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Mon Oct 03 01:55:24 2005

Date: Mon 03 Oct 2005 - 01:47:25 EST

Denis Chabot said the following on 2005-09-29 21:55:

> OK, I think I understand better but still have two points to clarify.

*>
**> The first one is about the number of df. I think those who replied on
**> this objected to the way I chose df, not the fact that I would run a
**> model with 7.4 df per se. If I read you correctly, I artificially
**> reduce my p-value by using the estimated df found in a mgcv gam model
**> into another model where I fix df. This is fine, I am quite willing not
**> to run a second model with a fixed df and instead tell my readers that
**> my model is "marginally significant" with a p-value of 0.03.
**>
**> This being said, do you know of guidelines for choosing df? A colleague
**> told me he does not go above 10% of the number of points. Should I be
**> concerned when mgcv estimates 7.4 df for 34 points? Note that for this
**> particular model P < 1e-16, and P is also very small if I fix df to
**> either 4 or 7.
**>
**> My second point is the difference between models fitted by packages gam
**> and mgcv. Sure, some of you have said "different algorithms". And when
*

Maybe that wasn't well put. Think of it as different implementations. The packages `mgcv' and `gam' are by no means the only implementations of GAM; see e.g. the `gss' package.

> I specify dfs, shouldn't P-values be very similar for the 2 packages?

*> If not, what does it say of the confidence we can have in the models?
*

Since there's no universally accepted definition of what a GAM is, you shouldn't expect the results to be the same.

> I draw your attention to this exampl: I obtained P-values of 0.50 and

*> 0.03 with packages gam and mgcv respectively:
*

In this case, however, you're trying to compare apples to oranges...

*> > library(gam)
*

> Loading required package: splines

*> > data(kyphosis)
**> > kyp1 <- gam(Kyphosis ~ s(Number, 3), family=binomial, data=kyphosis)
**> > summary.gam(kyp1)
**>
**> Call: gam(formula = Kyphosis ~ s(Number, 3), family = binomial, data =
**> kyphosis)
**> Deviance Residuals:
**> Min 1Q Median 3Q Max
**> -1.3646 -0.6233 -0.4853 -0.3133 2.0965
**>
**> (Dispersion Parameter for binomial family taken to be 1)
**>
**> Null Deviance: 83.2345 on 80 degrees of freedom
**> Residual Deviance: 71.9973 on 76.9999 degrees of freedom
**> AIC: 79.9976
**>
**> Number of Local Scoring Iterations: 7
**>
**> DF for Terms and Chi-squares for Nonparametric Effects
**>
**> Df Npar Df Npar Chisq P(Chi)
**> (Intercept) 1
**> s(Number, 3) 1 2 1.37149 0.50375
*

This test concerns only the non-linear part of the term s(Number, 3). In order to simultaneously test both the linear and non-linear part, as mgcv::summary.gam does, you'd

> kyp1.1 <- gam(Kyphosis ~ 1, family=binomial, data=kyphosis)
> anova(kyp1.1, kyp1, test = "Chisq")

Analysis of Deviance Table

Model 1: Kyphosis ~ 1

Model 2: Kyphosis ~ s(Number, 3)

Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 80.0000 83.234 2 76.9999 71.997 3.0001 11.237 0.011

**HTH,
**

Henric

*> > detach(package:gam)
**> > library(mgcv)
*

> This is mgcv 1.3-7

*> > kyp2 <- gam(Kyphosis ~ s(Number, k=4, fx=T), family=binomial,
**> data=kyphosis)
**> > summary.gam(kyp2)
**>
**> Family: binomial
**> Link function: logit
**>
**> Formula:
**> Kyphosis ~ s(Number, k = 4, fx = T)
**>
**> Parametric coefficients:
**> Estimate Std. Error z value Pr(>|z|)
**> (Intercept) -1.5504 0.3342 -4.64 3.49e-06 ***
**> ---
**> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
**>
**> Approximate significance of smooth terms:
**> edf Est.rank Chi.sq p-value
**> s(Number) 3 3 8.898 0.0307 *
**> ---
**> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
**>
**> R-sq.(adj) = 0.101 Deviance explained = 12.5%
**> UBRE score = 0.075202 Scale est. = 1 n = 81
**> > kyp2$deviance
**> [1] 72.85893
**> > kyp2$null.deviance
**> [1] 83.23447
**> > kyp2$df.null
**> [1] 80
**> > kyp2$df.residual
**> [1] 77
**>
**> How can we explain this huge difference?
**>
**> Denis
**>
**>
*

>> Le 05-09-29 à 06:00, r-help-request@stat.math.ethz.ch a écrit : >> >> De : Henric Nilsson <henric.nilsson@statisticon.se> >> Date : 29 septembre 2005 03:55:19 HAE >> À : ym@climpact.com >> Cc : r-help@stat.math.ethz.ch >> Objet : Rép : [R] p-level in packages mgcv and gam >> Répondre à : Henric Nilsson <henric.nilsson@statisticon.se> >> >> >> Yves Magliulo said the following on 2005-09-28 17:05: >> >>> hi, >>> i'll try to help you, i send a mail about this subject last week... and >>> i did not have any response... >>> I'm using gam from package mgcv. 1) >>> How to interpret the significance of smooth terms is hard for me to >>> understand perfectly : using UBRE, you fix df. p-value are estimated >>> by chi-sq distribution using GCV, the best df are estimated by GAM. >>> (that's what i want) and >>> p-values >> >> >> >> This is not correct. The df are estimated in both cases (i.e. UBRE >> and GCV), but the scale parameter is fixed in the UBRE case. Hence, >> by default UBRE is used for family = binomial or poisson since the >> scale parameter is assumed to be 1. Similarly, GCV is the default for >> family = gaussian since we most often want the scale (usually denoted >> sigma^2) to be estimated. >> >> >>> are estimated by an F distribution But in that case they said "use at >>> your own risk" in ?summary.gam >> >> >> >> The warning applies in both cases. The p-values are conditional on >> the smoothing parameters, and the uncertainty of the smooths is not >> taken into account when computing the p-values. >> >> >>> so you can also look at the chi.sq : but i don't know how to choose a >> >> >> >> No... >> >> >>> criterion like for p-values... for me, chi.sq show the best >>> predictor in >>> a model, but it's hard to reject one with it. >> >> >> >> Which version of mgcv do you use? The confusion probably stems from >> earlier versions of mgcv (< 1.3-5): the summary and anova methods >> used to have a column denoted Chi.sq even when the displayed >> statistic was computed as F. Recent versions of mgcv has >> >> > summary(b) >> >> Family: gaussian >> Link function: identity >> >> Formula: >> y ~ s(x0) + s(x1) + s(x2) + s(x3) >> >> Parametric coefficients: >> Estimate Std. Error t value Pr(>|t|) >> (Intercept) 7.9150 0.1049 75.44 <2e-16 *** >> --- >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >> >> Approximate significance of smooth terms: >> edf Est.rank F p-value >> s(x0) 5.173 9.000 3.785 0.000137 *** >> s(x1) 2.357 9.000 34.631 < 2e-16 *** >> s(x2) 8.517 9.000 84.694 < 2e-16 *** >> s(x3) 1.000 1.000 0.444 0.505797 >> --- >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >> >> R-sq.(adj) = 0.726 Deviance explained = 73.7% >> GCV score = 4.611 Scale est. = 4.4029 n = 400 >> >> >> If we assume that the scale is known and fixed at 4.4029, we get >> >> > summary(b, dispersion = 4.4029) >> >> Family: gaussian >> Link function: identity >> >> Formula: >> y ~ s(x0) + s(x1) + s(x2) + s(x3) >> >> Parametric coefficients: >> Estimate Std. Error z value Pr(>|z|) >> (Intercept) 7.9150 0.1049 75.44 <2e-16 *** >> --- >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >> >> Approximate significance of smooth terms: >> edf Est.rank Chi.sq p-value >> s(x0) 5.173 9.000 34.067 8.7e-05 *** >> s(x1) 2.357 9.000 311.679 < 2e-16 *** >> s(x2) 8.517 9.000 762.255 < 2e-16 *** >> s(x3) 1.000 1.000 0.444 0.505 >> --- >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >> >> R-sq.(adj) = 0.726 Deviance explained = 73.7% >> GCV score = 4.611 Scale est. = 4.4029 n = 400 >> >> Note that t/F changed into z/Chi.sq. >> >> >>> so as far as i m concerned, i use GCV methods, and fix a 5% on the null >>> hypothesis (pvalue) to select significant predictor. after, i look >>> at my >>> smooth, and if the parametrization look fine to me, i validate. >>> generaly, for p-values smaller than 0.001, you can be confident. over >>> 0.001, you have to check. 2) >>> for difference between package gam and mgcv, i sent a mail about this >> >> >> >> The underlying algorithms are very different. >> >> >> HTH, >> Henric >> >> >> De : "Liaw, Andy" <andy_liaw@merck.com> >> Date : 28 septembre 2005 14:01:25 HAE >> À : 'Denis Chabot' <chabotd@globetrotter.net>, Peter Dalgaard >> <p.dalgaard@biostat.ku.dk> >> Cc : Thomas Lumley <tlumley@u.washington.edu>, R list <r- >> help@stat.math.ethz.ch> >> Objet : RE: [R] p-level in packages mgcv and gam >> >> Just change the df in what Thomas described to the degree of >> polynomial, and >> everything he said still applies. Any good book on regression that >> covers >> polynomial regression ought to point this out. >> >> Andy >> >> >>> From: Denis Chabot >>> >>> But what about another analogy, that of polynomials? You may not be >>> sure what degree polynomial to use, and you have not decided before >>> analysing your data. You fit different polynomials to your data, >>> checking if added degrees increase r2 sufficiently by doing F-tests. >>> >>> I thought it was the same thing with GAMs. You can fit a >>> model with 4 >>> df, and in some cases it is of interest to see if this is a better >>> fit than a linear fit. But why can't you also check if 7df is better >>> than 4df? And if you used mgcv first and it tells you that 7df is >>> better than 4df, why bother repeating the comparison 7df >>> against 4df, >>> why not just take the p-value for the model with 7df (fixed)? >>> >>> Denis >> >>

>> De : Peter Dalgaard <p.dalgaard@biostat.ku.dk> >> Date : 28 septembre 2005 12:04:58 HAE >> À : Thomas Lumley <tlumley@u.washington.edu> >> Cc : Denis Chabot <chabotd@globetrotter.net>, R list <r- >> help@stat.math.ethz.ch> >> Objet : Rép : [R] p-level in packages mgcv and gam >> >> Thomas Lumley <tlumley@u.washington.edu> writes: >> >>> >>> Bob, on the other hand, chooses the amount of smoothing depending on >>> the data. When a 4 df smooth fits best he ends up with the same model >>> as Alice and the same p-value. When some other df fits best he ends >>> up with a different model and a *smaller* p-value than Alice. >> >> >> This doesn't actually follow, unless the p-value (directly or >> indirectly) found its way into the definition of "best fit". It does >> show the danger, though. >> >> -- >> O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B >> c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K >> (*) \(*) -- University of Copenhagen Denmark Ph: (+45) >> 35327918 >> ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) >> 35327907 >>

>> De : Thomas Lumley <tlumley@u.washington.edu> >> Date : 26 septembre 2005 12:54:43 HAE >> À : Denis Chabot <chabotd@globetrotter.net> >> Cc : r-help@stat.math.ethz.ch >> Objet : Rép : [R] p-level in packages mgcv and gam >> >> On Mon, 26 Sep 2005, Denis Chabot wrote: >> >> But the mgcv manual warns that p-level for the smooth can be >> underestimated when df are estimated by the model. Most of the time >> my p-levels are so small that even doubling them would not result in >> a value close to the P=0.05 threshold, but I have one case with P=0.033. >> >> I thought, probably naively, that running a second model with fixed >> df, using the value of df found in the first model. I could not >> achieve this with mgcv: its gam function does not seem to accept >> fractional values of df (in my case 8.377). >> >> No, this won't work. The problem is the usual one with model >> selection: the p-value is calculated as if the df had been fixed, >> when really it was estimated. >> >> It is likely to be quite hard to get an honest p-value out of >> something that does adaptive smoothing. >> >> -thomas >> De : Thomas Lumley <tlumley@u.washington.edu> >> Date : 28 septembre 2005 11:33:27 HAE >> À : Denis Chabot <chabotd@globetrotter.net> >> Cc : R list <r-help@stat.math.ethz.ch> >> Objet : Rép : [R] p-level in packages mgcv and gam >> >> On Wed, 28 Sep 2005, Denis Chabot wrote: >> >> I only got one reply to my message: >> >> No, this won't work. The problem is the usual one with model >> selection: the p-value is calculated as if the df had been fixed, >> when really it was estimated. >> >> It is likely to be quite hard to get an honest p-value out of >> something that does adaptive smoothing. >> >> -thomas >> >> I do not understand this: it seems that a lot of people chose df=4 >> for no particular reason, but p-levels are correct. If instead I >> choose df=8 because a previous model has estimated this to be an >> optimal df, P-levels are no good because df are estimated? >> >> Yes. I know this sounds strange initially, but it really does make >> sense if you think about it carefully. >> >> Suppose that Alice and Bob are kyphosis researchers, and that Alice >> always chooses 4df for smoothing Age. We would all agree that her >> p-values are correct [in fact we wouldn't, but that is a separate issue] >> >> Bob, on the other hand, chooses the amount of smoothing depending on >> the data. When a 4 df smooth fits best he ends up with the same model >> as Alice and the same p-value. When some other df fits best he ends >> up with a different model and a *smaller* p-value than Alice. >> >> In particular, this is still true under the null hypothesis that Age >> has no effect [If Alice and Bob are interested in p-values, the null >> hypothesis must be plausible.] >> >> If Bob's p-values are always less than or equal to Alice's p-values >> under the null hypothesis, and Alice's p-values are less than 0.05 5% >> of the time, then Bob's p-values are less than 0.05 more than 5% of >> the time. >> >> >> -thomas >> >> >> Furthermore, shouldn't packages gam and mgcv give similar results >> when the same data and df are used? I tried this: >> >> library(gam) >> data(kyphosis) >> kyp1 <- gam(Kyphosis ~ s(Age, 4), family=binomial, data=kyphosis) >> kyp2 <- gam(Kyphosis ~ s(Number, 4), family=binomial, data=kyphosis) >> kyp3 <- gam(Kyphosis ~ s(Start, 4), family=binomial, data=kyphosis) >> anova.gam(kyp1) >> anova.gam(kyp2) >> anova.gam(kyp3) >> >> detach(package:gam) >> library(mgcv) >> kyp4 <- gam(Kyphosis ~ s(Age, k=4, fx=T), family=binomial, >> data=kyphosis) >> kyp5 <- gam(Kyphosis ~ s(Number, k=4, fx=T), family=binomial, >> data=kyphosis) >> kyp6 <- gam(Kyphosis ~ s(Start, k=4, fx=T), family=binomial, >> data=kyphosis) >> anova.gam(kyp4) >> anova.gam(kyp5) >> anova.gam(kyp6) >> >> >> P levels for these models, by pair >> >> kyp1 vs kyp4: p= 0.083 and 0.068 respectively (not too bad) >> kyp2 vs kyp5: p= 0.445 and 0.03 (wow!) >> kyp3 vs kyp6: p= 0.053 and 0.008 (wow again) >> >> Also if you plot all these you find that the mgcv plots are smoother >> than the gam plots, even the same df are used all the time. >> >> I am really confused now! >> >> Denis >>

>> >> Thomas Lumley Assoc. Professor, Biostatistics >> tlumley@u.washington.edu University of Washington, SeattleDe : >> Thomas Lumley <tlumley@u.washington.edu> >> Date : 28 septembre 2005 14:35:26 HAE >> À : Denis Chabot <chabotd@globetrotter.net> >> Cc : Peter Dalgaard <p.dalgaard@biostat.ku.dk>, R list <r- >> help@stat.math.ethz.ch> >> Objet : Rép : [R] p-level in packages mgcv and gam >> >> On Wed, 28 Sep 2005, Denis Chabot wrote: >> >> But what about another analogy, that of polynomials? You may not be >> sure what degree polynomial to use, and you have not decided before >> analysing your data. You fit different polynomials to your data, >> checking if added degrees increase r2 sufficiently by doing F-tests. >> >> Yes, you can. And this procedure gives you incorrect p-values. >> >> They may not be very incorrect -- it depends on how much model >> selection you do, and how strongly the feature you are selecting on >> is related to the one you are testing. >> >> For example, using step() to choose a polynomial in x even when x is >> unrelated to y and z inflates the Type I error rate by giving a >> biased estimate of the residual mean squared error: >> >> once<-function(){ >> y<-rnorm(50);x<-runif(50);z<-rep(0:1,25) >> summary(step(lm(y~z), >> scope=list(lower=~z,upper=~z+x+I(x^2)+I(x^3)+I(x^4)), >> trace=0))$coef["z",4] >> } >> p<-replicate(1000,once()) >> mean(p<0.05) >> [1] 0.072 >> >> which is significantly higher than you would expect for an honest >> level 0.05 test. >> >> -thomas

>

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Mon Oct 03 01:55:24 2005

*
This archive was generated by hypermail 2.1.8
: Fri 03 Mar 2006 - 03:40:34 EST
*