From: Michal Figurski <figurski_at_mail.med.upenn.edu>

Date: Fri, 01 Aug 2008 13:52:40 -0400

Date: Fri, 01 Aug 2008 13:52:40 -0400

Dear all,

Your constant talking about what bootstrap is and is not suitable for made me finally verify the findings in the Pawinski et al paper.

Here is the procedure and the findings:

- First of all I took the raw data (that was posted earlier on this list) and estimated the AUC values using equation coefficients of their recommended model (#10). Though, I was _unable to reproduce_ the r^2, nor the predictive performance values. My results are 0.74 and 44%, respectively, while the reported figures were 0.862 and 82% (41 profiles out of 50). My scatterplot also looks different than the Fig.2 model 10 scatterplot. Weird...
- Then, I fit the multiple linear model to the whole dataset (no bootstrap), using the time-points of model #10. I obtained r^2 of 0.74 (agreement), mean prediction error of 7.4% +-28.3% and predictive performance of 44%. The mean reported prediction error (PE) was 7.6% +-26.7% and predictive performance: 56% (page 1502, second column, sentence 2nd from top)! I think the difference in PE may be attributed to numerical differences between SPSS and R, though I can't explain the difference in predictive performance.
- Finally, I used Gustaf's bootstrap code to fit linear regression with model #10 time-points on the resampled dataset. The r^2 of the model with median coefficients was identical to that of the model fit to entire data, and the predictive performance was better by only one profile in the range: 46%. As you see, these figures are very far from the numbers reported in the paper. I will be in discussion with the authors on how they obtained these numbers, but I am having doubts if this paper is valid at all...
- Later I tested it on my own dataset (paper to appear in August), and found that the MLR model fit on entire data has identical r^2 and predictive performance as the median coefficient model from bootstrap.

I must admit, guys, *that I was wrong and you were right: this bootstrap-like procedure does not improve predictions* - at least not to the extent reported in the Pawinski et al paper.

I was blindly believing in this paper and I am somewhat embarrassed that I didn't verify these findings, despite that their dataset was available to me since beginning. Maybe it was too much trust in printed word and in authority of a PhD biostatistician who devised the procedure...

Nevertheless, I am happy that at least this procedure is harmless, and that I can reproduce the figures reported in /my/ paper.

Best regards, and apologies for being such a hard student. I am being converted to orthodox statistics.

-- Michal J. Figurski HUP, Pathology & Laboratory Medicine Xenobiotics Toxicokinetics Research Laboratory 3400 Spruce St. 7 Maloney Philadelphia, PA 19104 tel. (215) 662-3413 Gustaf Rydevik wrote:Received on Fri 01 Aug 2008 - 18:09:11 GMT

> On Thu, Jul 31, 2008 at 4:30 PM, Michal Figurski

> <figurski_at_mail.med.upenn.edu> wrote:

>> Frank and all, >> >> The point you were looking for was in a page that was linked from the >> referenced page - I apologize for confusion. Please take a look at the two >> last paragraphs here: >> http://people.revoledu.com/kardi/tutorial/Bootstrap/examples.htm >> >> Though, possibly it's my ignorance, maybe it's yours, but you actually >> missed the important point again. It is that you just don't estimate mean, >> or CI, or variance on PK profile data! It is as if you were trying to >> estimate mean, CI and variance of a "Toccata_&_Fugue_in_D_minor.wav" file. >> What for? The point is in the music! Would the mean or CI or variance tell >> you anything about that? Besides, everybody knows the variance (or >> variability?) is there and can estimate it without spending time on >> calculations. >> What I am trying to do is comparable to compressing a wave into mp3 - to >> predict the wave using as few data points as possible. I have a bunch of >> similar waves and I'm trying to find a common equation to predict them all. >> I am *not* looking for the variance of the mean! >> >> I could be wrong (though it seems less and less likely), but you keep >> talking about the same irrelevant parameters (CI, variance) on and on. Well, >> yes - we are at a standstill, but not because of Davison & Hinkley's book. I >> can try reading it, though as I stated above, it is not even "remotely >> related" to what I am trying to do. I'll skip it then - life is too short. >> >> Nevertheless I thank you (all) for relevant criticism on the procedure (in >> the points where it was relevant). I plan to use this methodology further, >> and it was good to find out that it withstood your criticism. I will look >> into the penalized methods, though. >> >> -- >> Michal J. Figurski >>

>

> I take it you mean the sentence:

>> " For example, in here, the statistical estimator is the sample mean.> Using bootstrap sampling, you can do beyond your statistical> estimators. You can now get even the distribution of your estimator> and the statistics (such as confidence interval, variance) of your> estimator.">> Again you are misinterpreting text. The phrase about "doing beyond> your statistical estimators", is explained in the next sentence, where> he says that using bootstrap gives you information about the mean> *estimator* (and not more information about the population mean).> And since you're not interested in this information, in your case> bootstrap/resampling is not useful at all.>> As another example of misinterpretation: In your email from a week> ago, it sounds like you believe that the authors of the original paper> are trying to improve on a fixed model> Figurski:> "Regarding the "multiple stepwise regression" - according to the cited> SPSS manual, there are 5 options to select from. I don't think they used> 'stepwise selection' option, because their models were already> pre-defined. Variables were pre-selected based on knowledge of> pharmacokinetics of this drug and other factors. I think this part I> understand pretty well.">> This paragraph is wrong. Sorry, no way around it.>> Quoting from the paper Pawinski etal:> " *__Twenty-six____(!)* 1-, 2-, or 3-sample estimation> models were fit (r2 0.341� 0.862) to a randomly> selected subset of the profiles using linear regression> and were used to estimate AUC0�12h for the profiles not> included in the regression fit, comparing those estimates> with the corresponding AUC0�12h values, calculated> with the linear trapezoidal rule, including all 12> timed MPA concentrations. The 3-sample models were> constrained to include no samples past 2 h."> (emph. mine)>> They clearly state that they are choosing among 26 different models by> using their bootstrap-like procedure, not improving on a single,> predefined model.> This procedure is statistically sound (more or less at least), and not> controversial.>> However, (again) what you are wanting to do is *not* what they did in> their paper!> resampling can not improve on the performance of a pre-specified> model. This is intuitively obvious, but moreover its mathematically> provable! That's why we're so certain of our standpoint. If you really> wish, I (or someone else) could write out a proof, but I'm unsure if> you would be able to follow.>> In the end, it doesn't really matter. What you are doing amounts to> doing a regression 50 times, when once would suffice. No big harm> done, just a bit of unnecessary work. And proof to a statistically> competent reviewer that you don't really understand what you're doing.> The better option would be to either study some more statistics> yourself, or find a statistician that can do your analysis for you,> and trust him to do it right.>> Anyhow, good luck with your research.>> Best regards,>> Gustaf

> ______________________________________________ R-help_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Fri 01 Aug 2008 - 18:33:07 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*