From: Frank Harrell <f.harrell_at_vanderbilt.edu>

Date: Mon, 25 Apr 2011 08:21:33 -0700 (PDT)

*>
*

> ______________________________________________

*> R-help_at_r-project.org mailing list
*

*> https://stat.ethz.ch/mailman/listinfo/r-help
*

*> PLEASE do read the posting guide
*

*> http://www.R-project.org/posting-guide.html
*

*> and provide commented, minimal, self-contained, reproducible code.
*

*>
*

Frank Harrell

Department of Biostatistics, Vanderbilt University

Date: Mon, 25 Apr 2011 08:21:33 -0700 (PDT)

You've done a lot of good work on this. Yes I would say you have moderate overfitting with the first model. The only thing that saved you from having severe overfitting is that there seems to be a signal present [I am assume this model is truly pre-specified and was not developed at all by looking at patterns of responses Y.]

Penalization somewhat solves the EPV problem but there is no substitute for getting more data.

You can run validate specifying your final penalty as an argument.

細田弘吉 wrote:

*>
*

> According to the advice, I tried rms package.

*> Just to make sure, I have data of 104 patients (x6.df), which consists
**> of 5 explanatory variables and one binary outcome (poor/good) (previous
**> model 2 strategy). The outcome consists of 25 poor results and 79 good
**> results. Therefore, My events per variable (EPV) is only 5 (much less
**> than the rule of thumb of 10).
**>
**> My questions are about validate and pentrace in rms package.
**> I present some codes and results.
**> I appreciate anybody's help in advance.
**>
**> > x6.lrm <- lrm(outcome ~ stenosis+x1+x2+procedure+ClinicalScore,
**> data=x6.df, x=T, y=T)
**>
**> > x6.lrm
**> ...
**> Obs 104 LR chi2 29.24 R2 0.367 C 0.816
**> negative 79 d.f. 5 g 1.633 Dxy 0.632
**> positive 25 Pr(> chi2) <0.0001 gr 5.118 gamma 0.632
**> max |deriv| 1e-08 gp 0.237 tau-a 0.233
**> Brier 0.127
**>
**> Coef S.E. Wald Z Pr(>|Z|)
**> Intercept -5.5328 2.6287 -2.10 0.0353
**> stenosis -0.0150 0.0284 -0.53 0.5979
**> x1 3.0425 0.9100 3.34 0.0008
**> x2 -0.7534 0.4519 -1.67 0.0955
**> procedure 1.2085 0.5717 2.11 0.0345
**> ClinicalScore 0.3762 0.2287 1.65 0.0999
**>
**> It seems not too bad. Next, validation by bootstrap ...
**>
**> > validate(x6.lrm, B=200, bw=F)
**> index.orig training test optimism index.corrected n
**> Dxy 0.6324 0.6960 0.5870 0.1091 0.5233 200
**> R2 0.3668 0.4370 0.3154 0.1216 0.2453 200
**> Intercept 0.0000 0.0000 -0.2007 0.2007 -0.2007 200
**> Slope 1.0000 1.0000 0.7565 0.2435 0.7565 200
**> Emax 0.0000 0.0000 0.0999 0.0999 0.0999 200
**> D 0.2716 0.3368 0.2275 0.1093 0.1623 200
**> U -0.0192 -0.0192 0.0369 -0.0561 0.0369 200
**> Q 0.2908 0.3560 0.1906 0.1654 0.1254 200
**> B 0.1272 0.1155 0.1384 -0.0229 0.1501 200
**> g 1.6328 2.0740 1.4647 0.6093 1.0235 200
**> gp 0.2367 0.2529 0.2189 0.0341 0.2026 200
**>
**> The apparent Dxy is 0.63, and bias-corrected Dxy is 0.52. The maximum
**> absolute error is estimated to be 0.099. The changes in slope and
**> intercept are also more substantial. In all, there is evidence that I am
**> somewhat overfitting the data, right?.
**>
**> Furthermore, using step-down variable selection ...
**>
**> > validate(x6.lrm, B=200, bw=T)
**>
**> Backwards Step-down - Original Model
**>
**> Deleted Chi-Sq d.f. P Residual d.f. P AIC
**> stenosis 0.28 1 0.5979 0.28 1 0.5979 -1.72
**> ClinicalScore 2.60 1 0.1068 2.88 2 0.2370 -1.12
**> x2 2.86 1 0.0910 5.74 3 0.1252 -0.26
**>
**> Approximate Estimates after Deleting Factors
**>
**> Coef S.E. Wald Z P
**> Intercept -5.865 1.4136 -4.149 3.336e-05
**> x1 2.915 0.8685 3.357 7.889e-04
**> procedure 1.072 0.5590 1.918 5.508e-02
**>
**> Factors in Final Model
**>
**> [1] x1 procedure
**> index.orig training test optimism index.corrected n
**> Dxy 0.5661 0.6755 0.5559 0.1196 0.4464 200
**> R2 0.2876 0.4085 0.2784 0.1301 0.1575 200
**> Intercept 0.0000 0.0000 -0.2459 0.2459 -0.2459 200
**> Slope 1.0000 1.0000 0.7300 0.2700 0.7300 200
**> Emax 0.0000 0.0000 0.1173 0.1173 0.1173 200
**> D 0.2038 0.3130 0.1970 0.1160 0.0877 200
**> U -0.0192 -0.0192 0.0382 -0.0574 0.0382 200
**> Q 0.2230 0.3323 0.1589 0.1734 0.0496 200
**> B 0.1441 0.1192 0.1452 -0.0261 0.1702 200
**> g 1.2628 1.9524 1.3222 0.6302 0.6326 199
**> gp 0.2041 0.2430 0.2043 0.0387 0.1654 199
**>
**> If I select only two variables (x1 and procedure), bias-corrected Dxy
**> goes down to 0.45.
**>
**> [Question 1]
**> I have EPV problem. Even so, should I keep the full model (5-variable
**> model)? or can I use the 2-variable (x1 and procedure) model which the
**> validate() with step-down provides?
**>
**> [Question 2]
**> If I use 2-variable model, should I do
**> x2.lrm <- lrm(postopDWI_HI ~ T1+procedure2, data=x6.df, x=T, y=T)?
**> or keep the value showed above by validate function?
**>
**> Next, shrinkage ...
**>
**> > pentrace(x6.lrm, seq(0, 5.0, by=0.05))
**> Best penalty:
**> penalty df
**> 3.05 4.015378
**>
**> The best penalty is 3.05. So, I update it with this penalty to obtain
**> the corresponding penalized model:
**>
**> > x6.lrm.pen <- update(x6.lrm, penalty=3.05, x=T, y=T)
**> > x6.lrm.pen
**> .....
**> Penalty factors
**>
**> simple nonlinear interaction nonlinear.interaction
**> 3.05 3.05 3.05 3.05
**> Final penalty on -2 log L
**> [,1]
**> [1,] 3.8
**>
**> Obs 104 LR chi2 28.18 R2 0.313 C 0.818
**> negative 79 d.f. 4.015 g 1.264 Dxy 0.635
**> positive 25 Pr(> chi2) <0.0001 gr 3.538 gamma 0.637
**> max |deriv| 3e-05 gp 0.201 tau-a 0.234
**> Brier 0.129
**>
**> Coef S.E. Wald Z Pr(>|Z|) Penalty Scale
**> Intercept -4.7246 2.2429 -2.11 0.0352 0.0000
**> stenosis -0.0105 0.0240 -0.44 0.6621 17.8021
**> x1 2.3605 0.7254 3.25 0.0011 0.6054
**> x2 -0.5385 0.3653 -1.47 0.1404 1.2851
**> procedure 0.9247 0.4844 1.91 0.0563 0.8576
**> ClinicalScore 0.3046 0.1874 1.63 0.1041 2.4779
**>
**> Arrange the coefficients of the two models side by side, and also list
**> the difference between the two:
**>
**> > cbind(coef(x6.lrm), coef(x6.lrm.pen),
**> abs(coef(x6.lrm)-coef(x6.lrm.pen)))
**> [,1] [,2] [,3]
**> Intercept -5.53281808 -4.72464766 0.808170417
**> stenosis -0.01496757 -0.01050797 0.004459599
**> x1 3.04248257 2.36051833 0.681964238
**> x2 -0.75335619 -0.53854750 0.214808685
**> procedure 1.20847252 0.92474708 0.283725441
**> ClinicalScore 0.37623189 0.30457557 0.071656322
**>
**> [Question 3]
**> Is this penalized model the one I should present for my colleagues?
**> I still have EPV problem. Or is EPV problem O.K. if I use penalization?
**>
**> I am still wondering about what I can do to avoid EPV problem.
**> Collecting new data would be a long-time and huge work...
**>
**>
**> (11/04/22 1:46), khosoda_at_med.kobe-u.ac.jp wrote:
*

>> Thank you for your comment. >> I forgot to mention that varclus and pvclust showed similar results for >> my data. >> >> BTW, I did not realize rms is a replacement for the Design package. >> I appreciate your suggestion. >> -- >> KH >> >> (11/04/21 8:00), Frank Harrell wrote: >>> I think it's OK. You can also use the Hmisc package's varclus function. >>> Frank >>> >>> >>> 細田弘吉 wrote: >>>> >>>> Dear Prof. Harrel, >>>> >>>> Thank you very much for your quick advice. >>>> I will try rms package. >>>> >>>> Regarding model reduction, is my model 2 method (clustering and >>>> recoding >>>> that are blinded to the outcome) permissible? >>>> >>>> Sincerely, >>>> >>>> -- >>>> KH >>>> >>>> (11/04/20 22:01), Frank Harrell wrote: >>>>> Deleting variables is a bad idea unless you make that a formal part of >>>>> the >>>>> BMA so that the attempt to delete variables is penalized for. >>>>> Instead of >>>>> BMA I recommend simple penalized maximum likelihood estimation (see >>>>> the >>>>> lrm >>>>> function in the rms package) or pre-modeling data reduction that is >>>>> blinded >>>>> to the outcome variable. >>>>> Frank >>>>> >>>>> >>>>> 細田弘吉 wrote: >>>>>> >>>>>> Hi everybody, >>>>>> I apologize for long mail in advance. >>>>>> >>>>>> I have data of 104 patients, which consists of 15 explanatory >>>>>> variables >>>>>> and one binary outcome (poor/good). The outcome consists of 25 poor >>>>>> results and 79 good results. I tried to analyze the data with >>>>>> logistic >>>>>> regression. However, the 15 variables and 25 events means events per >>>>>> variable (EPV) is much less than 10 (rule of thumb). Therefore, I >>>>>> used R >>>>>> package, "BMA" to perform logistic regression with BMA to avoid this >>>>>> problem. >>>>>> >>>>>> model 1 (full model): >>>>>> x1, x2, x3, x4 are continuous variables and others are binary data. >>>>>> >>>>>>> x16.bic.glm<- bic.glm(outcome ~ ., data=x16.df, >>>>>> glm.family="binomial", OR20, strict=FALSE) >>>>>>> summary(x16.bic.glm) >>>>>> (The output below has been cut off at the right edge to save space) >>>>>> >>>>>> 62 models were selected >>>>>> Best 5 models (cumulative posterior probability = 0.3606 ): >>>>>> >>>>>> p!=0 EV SD model 1 model2 >>>>>> Intercept 100 -5.1348545 1.652424 -4.4688 -5.15 >>>>>> -5.1536 >>>>>> age 3.3 0.0001634 0.007258 . >>>>>> sex 4.0 >>>>>> .M -0.0243145 0.220314 . >>>>>> side 10.8 >>>>>> .R 0.0811227 0.301233 . >>>>>> procedure 46.9 -0.5356894 0.685148 . -1.163 >>>>>> symptom 3.8 -0.0099438 0.129690 . . >>>>>> stenosis 3.4 -0.0003343 0.005254 . >>>>>> x1 3.7 -0.0061451 0.144084 . >>>>>> x2 100.0 3.1707661 0.892034 3.2221 3.11 >>>>>> x3 51.3 -0.4577885 0.551466 -0.9154 . >>>>>> HT 4.6 >>>>>> .positive 0.0199299 0.161769 . . >>>>>> DM 3.3 >>>>>> .positive -0.0019986 0.105910 . . >>>>>> IHD 3.5 >>>>>> .positive 0.0077626 0.122593 . . >>>>>> smoking 9.1 >>>>>> .positive 0.0611779 0.258402 . . >>>>>> hyperlipidemia 16.0 >>>>>> .positive 0.1784293 0.512058 . . >>>>>> x4 8.2 0.0607398 0.267501 . . >>>>>> >>>>>> >>>>>> nVar 2 2 >>>>>> 1 3 3 >>>>>> BIC -376.9082 >>>>>> -376.5588 -376.3094 -375.8468 -374.5582 >>>>>> post prob 0.104 >>>>>> 0.087 0.077 0.061 0.032 >>>>>> >>>>>> [Question 1] >>>>>> Is it O.K to calculate odds ratio and its 95% confidence interval >>>>>> from >>>>>> "EV" (posterior distribution mean) and“SD”(posterior distribution >>>>>> standard deviation)? >>>>>> For example, 95%CI of EV of x2 can be calculated as; >>>>>>> exp(3.1707661) >>>>>> [1] 23.82573 -----> odds ratio >>>>>>> exp(3.1707661+1.96*0.892034) >>>>>> [1] 136.8866 >>>>>>> exp(3.1707661-1.96*0.892034) >>>>>> [1] 4.146976 >>>>>> ------------------> 95%CI (4.1 to 136.9) >>>>>> Is this O.K.? >>>>>> >>>>>> [Question 2] >>>>>> Is it permissible to delete variables with small value of "p!=0" and >>>>>> "EV", such as age (3.3% and 0.0001634) to reduce the number of >>>>>> explanatory variables and reconstruct new model without those >>>>>> variables >>>>>> for new session of BMA? >>>>>> >>>>>> model 2 (reduced model): >>>>>> I used R package, "pvclust", to reduce the model. The result >>>>>> suggested >>>>>> x1, x2 and x4 belonged to the same cluster, so I picked up only x2. >>>>>> Based on the subject knowledge, I made a simple unweighted sum, by >>>>>> counting the number of clinical features. For 9 features (sex, side, >>>>>> HT2, hyperlipidemia, DM, IHD, smoking, symptom, age), the sum ranges >>>>>> from 0 to 9. This score was defined as ClinicalScore. Consequently, I >>>>>> made up new data set (x6.df), which consists of 5 variables >>>>>> (stenosis, >>>>>> x2, x3, procedure, and ClinicalScore) and one binary outcome >>>>>> (poor/good). Then, for alternative BMA session... >>>>>> >>>>>>> BMAx6.glm<- bic.glm(postopDWI_HI ~ ., data=x6.df, >>>>>> glm.family="binomial", OR=20, strict=FALSE) >>>>>>> summary(BMAx6.glm) >>>>>> (The output below has been cut off at the right edge to save space) >>>>>> Call: >>>>>> bic.glm.formula(f = postopDWI_HI ~ ., data = x6.df, glm.family = >>>>>> "binomial", strict = FALSE, OR = 20) >>>>>> >>>>>> >>>>>> 13 models were selected >>>>>> Best 5 models (cumulative posterior probability = 0.7626 ): >>>>>> >>>>>> p!=0 EV SD model 1 model 2 >>>>>> Intercept 100 -5.6918362 1.81220 -4.4688 -6.3166 >>>>>> stenosis 8.1 -0.0008417 0.00815 . . >>>>>> x2 100.0 3.0606165 0.87765 3.2221 3.1154 >>>>>> x3 46.5 -0.3998864 0.52688 -0.9154 . >>>>>> procedure 49.3 0.5747013 0.70164 . 1.1631 >>>>>> ClinicalScore 27.1 0.0966633 0.19645 . . >>>>>> >>>>>> >>>>>> nVar 2 2 1 >>>>>> 3 3 >>>>>> BIC -376.9082 -376.5588 >>>>>> -376.3094 -375.8468 -375.5025 >>>>>> post prob 0.208 0.175 >>>>>> 0.154 0.122 0.103 >>>>>> >>>>>> [Question 3] >>>>>> Am I doing it correctly or not? >>>>>> I mean this kind of model reduction is permissible for BMA? >>>>>> >>>>>> [Question 4] >>>>>> I still have 5 variables, which violates the rule of thumb, "EPV> >>>>>> 10". >>>>>> Is it permissible to delete "stenosis" variable because of small >>>>>> value >>>>>> of "EV"? Or is it O.K. because this is BMA? >>>>>> >>>>>> Sorry for long post. >>>>>> >>>>>> I appreciate your help very much in advance. >>>>>> >>>>>> -- >>>>>> KH >>>>>> >>>>>> ______________________________________________ >>>>>> R-help_at_r-project.org mailing list >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide >>>>>> http://www.R-project.org/posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>> >>>>> >>>>> >>>>> ----- >>>>> Frank Harrell >>>>> Department of Biostatistics, Vanderbilt University >>>>> -- >>>>> View this message in context: >>>>> http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3462919.html >>>>> >>>>> Sent from the R help mailing list archive at Nabble.com. >>>>>

> ______________________________________________

Frank Harrell

Department of Biostatistics, Vanderbilt University

-- View this message in context: http://r.789695.n4.nabble.com/BMA-logistic-regression-odds-ratio-model-reduction-etc-tp3462416p3473354.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.Received on Mon 25 Apr 2011 - 16:04:49 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Tue 26 Apr 2011 - 16:20:35 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*