From: <khosoda_at_med.kobe-u.ac.jp>

Date: Tue, 26 Apr 2011 23:53:10 +0900

Date: Tue, 26 Apr 2011 23:53:10 +0900

Thank you for you quick reply, Prof. Harrell. According to your advice, I ran pentrace using a very wide range.

> pentrace.x6factor <- pentrace(x6factor.lrm, seq(0, 100, by=0.5)) > plot(pentrace.x6factor)

> pentrace.x6factor <- pentrace(x6factor.lrm, seq(0, 10, by=0.05))

It seems reasonable that the best penalty is 2.55.

[,1] [,2] [,3] Intercept -4.32434556 -3.86816460 0.456180958 stenosis -0.01496757 -0.01091755 0.004050025 T1 3.04248257 2.42443034 0.618052225 T2 -0.75335619 -0.57194342 0.181412767 procedure -1.20847252 -0.82589263 0.382579892ClinicalScore 0.37623189 0.30524628 0.070985611

> validate(x6factor.lrm, bw=F, B=200)

index.orig training test optimism index.corrected n Dxy 0.6324 0.6849 0.5955 0.0894 0.5430 200 R2 0.3668 0.4220 0.3231 0.0989 0.2679 200 Intercept 0.0000 0.0000 -0.1924 0.1924 -0.1924 200 Slope 1.0000 1.0000 0.7796 0.2204 0.7796 200 Emax 0.0000 0.0000 0.0915 0.0915 0.0915 200 D 0.2716 0.3229 0.2339 0.0890 0.1826 200 U -0.0192 -0.0192 0.0243 -0.0436 0.0243 200 Q 0.2908 0.3422 0.2096 0.1325 0.1582 200 B 0.1272 0.1171 0.1357 -0.0186 0.1457 200 g 1.6328 1.9879 1.4940 0.4939 1.1389 200 gp 0.2367 0.2502 0.2216 0.0286 0.2080 200 > validate(x6factor.lrm.pen, bw=F, B=200) index.orig training test optimism index.corrected n Dxy 0.6375 0.6857 0.6024 0.0833 0.5542 200 R2 0.3145 0.3488 0.3267 0.0221 0.2924 200 Intercept 0.0000 0.0000 0.0882 -0.0882 0.0882 200 Slope 1.0000 1.0000 1.0923 -0.0923 1.0923 200 Emax 0.0000 0.0000 0.0340 0.0340 0.0340 200 D 0.2612 0.2571 0.2370 0.0201 0.2411 200 U -0.0192 -0.0192 -0.0047 -0.0145 -0.0047 200 Q 0.2805 0.2763 0.2417 0.0346 0.2458 200 B 0.1292 0.1224 0.1355 -0.0132 0.1423 200 g 1.2704 1.3917 1.5019 -0.1102 1.3805 200 gp 0.2020 0.2091 0.2229 -0.0138 0.2158 200

In the penalized model (x6factor.lrm.pen), the apparent Dxy is 0.64, and bias-corrected Dxy is 0.55. The maximum absolute error is estimated to be 0.034, smaller than non-penalized model (0.0915 in x6factor.lrm) The changes in slope and intercept are substantially reduced in penalized model. I think overfitting is improved at least to some extent. Should I select this as a final model?

I have one more question. The "procedure" variable was defined as 0/1 value in the previous mail. For some graphical reason, I redefined it as treat1/treat2 value. Then, the best penalty value was changed from 3.05 to 2.55. I guess change from numeric to factorial caused this reduction in penalty. Which set up should I select?

I appreciate your help in advance.

-- KH (11/04/26 0:21), Frank Harrell wrote:Received on Tue 26 Apr 2011 - 16:13:36 GMT

> You've done a lot of good work on this. Yes I would say you have moderate

> overfitting with the first model. The only thing that saved you from having> severe overfitting is that there seems to be a signal present [I am assume> this model is truly pre-specified and was not developed at all by looking at> patterns of responses Y.]>> The use of backwards stepdown demonstrated much worse overfitting. This is> in line with what we know about the damage of stepwise selection methods> that do not incorporate shrinkage. I would throw away the stepwise> regression model. You'll find that the model selected is entirely> arbitrary. And you can't use the "selected" variables in any re-fit of the> model, i.e., you can't use lrm pretending that the two remaining variables> were pre-specified. Stepwise regression methods only seem to help. When> assessed properly we see that is an illusion.>> You are using penalizing properly but you did not print the full table of> penalties vs. effective AIC. We don't have faith that you penalized enough.> I tend to run pentrace using a very wide range of possible penalties to make> sure I've found the global optimum.>> Penalization somewhat solves the EPV problem but there is no substitute for> getting more data.>> You can run validate specifying your final penalty as an argument.>> Frank>>>> 細田弘吉 wrote:>>>> According to the advice, I tried rms package.>> Just to make sure, I have data of 104 patients (x6.df), which consists>> of 5 explanatory variables and one binary outcome (poor/good) (previous>> model 2 strategy). The outcome consists of 25 poor results and 79 good>> results. Therefore, My events per variable (EPV) is only 5 (much less>> than the rule of thumb of 10).>>>> My questions are about validate and pentrace in rms package.>> I present some codes and results.>> I appreciate anybody's help in advance.>>>> > x6.lrm<- lrm(outcome ~ stenosis+x1+x2+procedure+ClinicalScore,>> data=x6.df, x=T, y=T)>>>> > x6.lrm>> ...>> Obs 104 LR chi2 29.24 R2 0.367 C 0.816>> negative 79 d.f. 5 g 1.633 Dxy 0.632>> positive 25 Pr(> chi2)<0.0001 gr 5.118 gamma 0.632>> max |deriv| 1e-08 gp 0.237 tau-a 0.233>> Brier 0.127>>>> Coef S.E. Wald Z Pr(>|Z|)>> Intercept -5.5328 2.6287 -2.10 0.0353>> stenosis -0.0150 0.0284 -0.53 0.5979>> x1 3.0425 0.9100 3.34 0.0008>> x2 -0.7534 0.4519 -1.67 0.0955>> procedure 1.2085 0.5717 2.11 0.0345>> ClinicalScore 0.3762 0.2287 1.65 0.0999>>>> It seems not too bad. Next, validation by bootstrap ...>>>> > validate(x6.lrm, B=200, bw=F)>> index.orig training test optimism index.corrected n>> Dxy 0.6324 0.6960 0.5870 0.1091 0.5233 200>> R2 0.3668 0.4370 0.3154 0.1216 0.2453 200>> Intercept 0.0000 0.0000 -0.2007 0.2007 -0.2007 200>> Slope 1.0000 1.0000 0.7565 0.2435 0.7565 200>> Emax 0.0000 0.0000 0.0999 0.0999 0.0999 200>> D 0.2716 0.3368 0.2275 0.1093 0.1623 200>> U -0.0192 -0.0192 0.0369 -0.0561 0.0369 200>> Q 0.2908 0.3560 0.1906 0.1654 0.1254 200>> B 0.1272 0.1155 0.1384 -0.0229 0.1501 200>> g 1.6328 2.0740 1.4647 0.6093 1.0235 200>> gp 0.2367 0.2529 0.2189 0.0341 0.2026 200>>>> The apparent Dxy is 0.63, and bias-corrected Dxy is 0.52. The maximum>> absolute error is estimated to be 0.099. The changes in slope and>> intercept are also more substantial. In all, there is evidence that I am>> somewhat overfitting the data, right?.>>>> Furthermore, using step-down variable selection ...>>>> > validate(x6.lrm, B=200, bw=T)>>>> Backwards Step-down - Original Model>>>> Deleted Chi-Sq d.f. P Residual d.f. P AIC>> stenosis 0.28 1 0.5979 0.28 1 0.5979 -1.72>> ClinicalScore 2.60 1 0.1068 2.88 2 0.2370 -1.12>> x2 2.86 1 0.0910 5.74 3 0.1252 -0.26>>>> Approximate Estimates after Deleting Factors>>>> Coef S.E. Wald Z P>> Intercept -5.865 1.4136 -4.149 3.336e-05>> x1 2.915 0.8685 3.357 7.889e-04>> procedure 1.072 0.5590 1.918 5.508e-02>>>> Factors in Final Model>>>> [1] x1 procedure>> index.orig training test optimism index.corrected n>> Dxy 0.5661 0.6755 0.5559 0.1196 0.4464 200>> R2 0.2876 0.4085 0.2784 0.1301 0.1575 200>> Intercept 0.0000 0.0000 -0.2459 0.2459 -0.2459 200>> Slope 1.0000 1.0000 0.7300 0.2700 0.7300 200>> Emax 0.0000 0.0000 0.1173 0.1173 0.1173 200>> D 0.2038 0.3130 0.1970 0.1160 0.0877 200>> U -0.0192 -0.0192 0.0382 -0.0574 0.0382 200>> Q 0.2230 0.3323 0.1589 0.1734 0.0496 200>> B 0.1441 0.1192 0.1452 -0.0261 0.1702 200>> g 1.2628 1.9524 1.3222 0.6302 0.6326 199>> gp 0.2041 0.2430 0.2043 0.0387 0.1654 199>>>> If I select only two variables (x1 and procedure), bias-corrected Dxy>> goes down to 0.45.>>>> [Question 1]>> I have EPV problem. Even so, should I keep the full model (5-variable>> model)? or can I use the 2-variable (x1 and procedure) model which the>> validate() with step-down provides?>>>> [Question 2]>> If I use 2-variable model, should I do>> x2.lrm<- lrm(postopDWI_HI ~ T1+procedure2, data=x6.df, x=T, y=T)?>> or keep the value showed above by validate function?>>>> Next, shrinkage ...>>>> > pentrace(x6.lrm, seq(0, 5.0, by=0.05))>> Best penalty:>> penalty df>> 3.05 4.015378>>>> The best penalty is 3.05. So, I update it with this penalty to obtain>> the corresponding penalized model:>>>> > x6.lrm.pen<- update(x6.lrm, penalty=3.05, x=T, y=T)>> > x6.lrm.pen>> .....>> Penalty factors>>>> simple nonlinear interaction nonlinear.interaction>> 3.05 3.05 3.05 3.05>> Final penalty on -2 log L>> [,1]>> [1,] 3.8>>>> Obs 104 LR chi2 28.18 R2 0.313 C 0.818>> negative 79 d.f. 4.015 g 1.264 Dxy 0.635>> positive 25 Pr(> chi2)<0.0001 gr 3.538 gamma 0.637>> max |deriv| 3e-05 gp 0.201 tau-a 0.234>> Brier 0.129>>>> Coef S.E. Wald Z Pr(>|Z|) Penalty Scale>> Intercept -4.7246 2.2429 -2.11 0.0352 0.0000>> stenosis -0.0105 0.0240 -0.44 0.6621 17.8021>> x1 2.3605 0.7254 3.25 0.0011 0.6054>> x2 -0.5385 0.3653 -1.47 0.1404 1.2851>> procedure 0.9247 0.4844 1.91 0.0563 0.8576>> ClinicalScore 0.3046 0.1874 1.63 0.1041 2.4779>>>> Arrange the coefficients of the two models side by side, and also list>> the difference between the two:>>>> > cbind(coef(x6.lrm), coef(x6.lrm.pen),>> abs(coef(x6.lrm)-coef(x6.lrm.pen)))>> [,1] [,2] [,3]>> Intercept -5.53281808 -4.72464766 0.808170417>> stenosis -0.01496757 -0.01050797 0.004459599>> x1 3.04248257 2.36051833 0.681964238>> x2 -0.75335619 -0.53854750 0.214808685>> procedure 1.20847252 0.92474708 0.283725441>> ClinicalScore 0.37623189 0.30457557 0.071656322>>>> [Question 3]>> Is this penalized model the one I should present for my colleagues?>> I still have EPV problem. Or is EPV problem O.K. if I use penalization?>>>> I am still wondering about what I can do to avoid EPV problem.>> Collecting new data would be a long-time and huge work...>>>>>> (11/04/22 1:46), khosoda_at_med.kobe-u.ac.jp wrote:>>> Thank you for your comment.>>> I forgot to mention that varclus and pvclust showed similar results for>>> my data.>>>>>> BTW, I did not realize rms is a replacement for the Design package.>>> I appreciate your suggestion.>>> -->>> KH>>>>>> (11/04/21 8:00), Frank Harrell wrote:>>>> I think it's OK. You can also use the Hmisc package's varclus function.>>>> Frank>>>>>>>>>>>> 細田弘吉 wrote:>>>>>>>>>> Dear Prof. Harrel,>>>>>>>>>> Thank you very much for your quick advice.>>>>> I will try rms package.>>>>>>>>>> Regarding model reduction, is my model 2 method (clustering and>>>>> recoding>>>>> that are blinded to the outcome) permissible?>>>>>>>>>> Sincerely,>>>>>>>>>> -->>>>> KH>>>>>>>>>> (11/04/20 22:01), Frank Harrell wrote:>>>>>> Deleting variables is a bad idea unless you make that a formal part of>>>>>> the>>>>>> BMA so that the attempt to delete variables is penalized for.>>>>>> Instead of>>>>>> BMA I recommend simple penalized maximum likelihood estimation (see>>>>>> the>>>>>> lrm>>>>>> function in the rms package) or pre-modeling data reduction that is>>>>>> blinded>>>>>> to the outcome variable.>>>>>> Frank>>>>>>>>>>>>>>>>>> 細田弘吉 wrote:>>>>>>>>>>>>>> Hi everybody,>>>>>>> I apologize for long mail in advance.>>>>>>>>>>>>>> I have data of 104 patients, which consists of 15 explanatory>>>>>>> variables>>>>>>> and one binary outcome (poor/good). The outcome consists of 25 poor>>>>>>> results and 79 good results. I tried to analyze the data with>>>>>>> logistic>>>>>>> regression. However, the 15 variables and 25 events means events per>>>>>>> variable (EPV) is much less than 10 (rule of thumb). Therefore, I>>>>>>> used R>>>>>>> package, "BMA" to perform logistic regression with BMA to avoid this>>>>>>> problem.>>>>>>>>>>>>>> model 1 (full model):>>>>>>> x1, x2, x3, x4 are continuous variables and others are binary data.>>>>>>>>>>>>>>> x16.bic.glm<- bic.glm(outcome ~ ., data=x16.df,>>>>>>> glm.family="binomial", OR20, strict=FALSE)>>>>>>>> summary(x16.bic.glm)>>>>>>> (The output below has been cut off at the right edge to save space)>>>>>>>>>>>>>> 62 models were selected>>>>>>> Best 5 models (cumulative posterior probability = 0.3606 ):>>>>>>>>>>>>>> p!=0 EV SD model 1 model2>>>>>>> Intercept 100 -5.1348545 1.652424 -4.4688 -5.15>>>>>>> -5.1536>>>>>>> age 3.3 0.0001634 0.007258 .>>>>>>> sex 4.0>>>>>>> .M -0.0243145 0.220314 .>>>>>>> side 10.8>>>>>>> .R 0.0811227 0.301233 .>>>>>>> procedure 46.9 -0.5356894 0.685148 . -1.163>>>>>>> symptom 3.8 -0.0099438 0.129690 . .>>>>>>> stenosis 3.4 -0.0003343 0.005254 .>>>>>>> x1 3.7 -0.0061451 0.144084 .>>>>>>> x2 100.0 3.1707661 0.892034 3.2221 3.11>>>>>>> x3 51.3 -0.4577885 0.551466 -0.9154 .>>>>>>> HT 4.6>>>>>>> .positive 0.0199299 0.161769 . .>>>>>>> DM 3.3>>>>>>> .positive -0.0019986 0.105910 . .>>>>>>> IHD 3.5>>>>>>> .positive 0.0077626 0.122593 . .>>>>>>> smoking 9.1>>>>>>> .positive 0.0611779 0.258402 . .>>>>>>> hyperlipidemia 16.0>>>>>>> .positive 0.1784293 0.512058 . .>>>>>>> x4 8.2 0.0607398 0.267501 . .>>>>>>>>>>>>>>>>>>>>> nVar 2 2>>>>>>> 1 3 3>>>>>>> BIC -376.9082>>>>>>> -376.5588 -376.3094 -375.8468 -374.5582>>>>>>> post prob 0.104>>>>>>> 0.087 0.077 0.061 0.032>>>>>>>>>>>>>> [Question 1]>>>>>>> Is it O.K to calculate odds ratio and its 95% confidence interval>>>>>>> from>>>>>>> "EV" (posterior distribution mean) and“SD”(posterior distribution>>>>>>> standard deviation)?>>>>>>> For example, 95%CI of EV of x2 can be calculated as;>>>>>>>> exp(3.1707661)>>>>>>> [1] 23.82573 -----> odds ratio>>>>>>>> exp(3.1707661+1.96*0.892034)>>>>>>> [1] 136.8866>>>>>>>> exp(3.1707661-1.96*0.892034)>>>>>>> [1] 4.146976>>>>>>> ------------------> 95%CI (4.1 to 136.9)>>>>>>> Is this O.K.?>>>>>>>>>>>>>> [Question 2]>>>>>>> Is it permissible to delete variables with small value of "p!=0" and>>>>>>> "EV", such as age (3.3% and 0.0001634) to reduce the number of>>>>>>> explanatory variables and reconstruct new model without those>>>>>>> variables>>>>>>> for new session of BMA?>>>>>>>>>>>>>> model 2 (reduced model):>>>>>>> I used R package, "pvclust", to reduce the model. The result>>>>>>> suggested>>>>>>> x1, x2 and x4 belonged to the same cluster, so I picked up only x2.>>>>>>> Based on the subject knowledge, I made a simple unweighted sum, by>>>>>>> counting the number of clinical features. For 9 features (sex, side,>>>>>>> HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges>>>>>>> from 0 to 9. This score was defined as ClinicalScore. Consequently, I>>>>>>> made up new data set (x6.df), which consists of 5 variables>>>>>>> (stenosis,>>>>>>> x2, x3, procedure, and ClinicalScore) and one binary outcome>>>>>>> (poor/good). Then, for alternative BMA session...>>>>>>>>>>>>>>> BMAx6.glm<- bic.glm(postopDWI_HI ~ ., data=x6.df,>>>>>>> glm.family="binomial", OR=20, strict=FALSE)>>>>>>>> summary(BMAx6.glm)>>>>>>> (The output below has been cut off at the right edge to save space)>>>>>>> Call:>>>>>>> bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family =>>>>>>> "binomial", strict = FALSE, OR = 20)>>>>>>>>>>>>>>>>>>>>> 13 models were selected>>>>>>> Best 5 models (cumulative posterior probability = 0.7626 ):>>>>>>>>>>>>>> p!=0 EV SD model 1 model 2>>>>>>> Intercept 100 -5.6918362 1.81220 -4.4688 -6.3166>>>>>>> stenosis 8.1 -0.0008417 0.00815 . .>>>>>>> x2 100.0 3.0606165 0.87765 3.2221 3.1154>>>>>>> x3 46.5 -0.3998864 0.52688 -0.9154 .>>>>>>> procedure 49.3 0.5747013 0.70164 . 1.1631>>>>>>> ClinicalScore 27.1 0.0966633 0.19645 . .>>>>>>>>>>>>>>>>>>>>> nVar 2 2 1>>>>>>> 3 3>>>>>>> BIC -376.9082 -376.5588>>>>>>> -376.3094 -375.8468 -375.5025>>>>>>> post prob 0.208 0.175>>>>>>> 0.154 0.122 0.103>>>>>>>>>>>>>> [Question 3]>>>>>>> Am I doing it correctly or not?>>>>>>> I mean this kind of model reduction is permissible for BMA?>>>>>>>>>>>>>> [Question 4]>>>>>>> I still have 5 variables, which violates the rule of thumb, "EPV>>>>>>>> 10".>>>>>>> Is it permissible to delete "stenosis" variable because of small>>>>>>> value>>>>>>> of "EV"? Or is it O.K. because this is BMA?>>>>>>>>>>>>>> Sorry for long post.>>>>>>>>>>>>>> I appreciate your help very much in advance.>>>>>>>>>>>>>> -->>>>>>> KH>>>>>>>>>>>>>> ______________________________________________>>>>>>> R-help_at_r-project.org mailing list>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help>>>>>>> PLEASE do read the posting guide>>>>>>> http://www.R-project.org/posting-guide.html>>>>>>> and provide commented, minimal, self-contained, reproducible code.>>>>>>>>>>>>>>>>>>>>>>>>> ----->>>>>> Frank Harrell>>>>>> Department of Biostatistics, Vanderbilt University>>>>>> -->>>>>> View this message in context:>>>>>> http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3462919.html>>>>>>>>>>>> Sent from the R help mailing list archive at Nabble.com.>>>>>>>>>> ______________________________________________>> R-help_at_r-project.org mailing list>> https://stat.ethz.ch/mailman/listinfo/r-help>> PLEASE do read the posting guide>> http://www.R-project.org/posting-guide.html>> and provide commented, minimal, self-contained, reproducible code.>>>>> -----> Frank Harrell> Department of Biostatistics, Vanderbilt University> --> View this message in context: http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3473354.html> Sent from the R help mailing list archive at Nabble.com.>> ______________________________________________> R-help_at_r-project.org mailing list> https://stat.ethz.ch/mailman/listinfo/r-help> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.

-- ************************************************* 神戸大学大学院医学研究科 脳神経外科学分野 細田 弘吉 〒650-0017 神戸市中央区楠町7丁目5-1 Phone: 078-382-5966 Fax : 078-382-5979 E-mail address Office: khosoda_at_med.kobe-u.ac.jp Home : khosoda_at_venus.dti.ne.jp *************************************************

______________________________________________ R-help_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

- application/pdf attachment: x6factor.pentrace.pdf

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Fri 29 Apr 2011 - 13:50:33 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*