[R] PLS component selection for GPLS question

From: Torsten Schindler <Torsten.Schindler_at_chello.at>
Date: Fri 29 Jul 2005 - 21:21:05 EST

How to select the number of PLS components for GPLS for data sets with few samples?

Concrete problem:
My data set: 9 samples of class A and 37 of class B with 254 descriptors.

In the paper: "Classification Using Generalized Partial Least Squares", Beiying Ding, Robert Gentleman, Bioconductor Project Working Papers, year 2004, paper 5

Section 2.6 Assessing Prediction:
Cite: "The optimal number of PLS components is selected by choosing that value of K which minimizes LOOCV
error rate for the training set."

and in section 3.1.3 Colon data, subsection: Random splitting Cite: "Due to the instability of LOOCV error rates for data with few samples and many covariates, comparison of various classifiers based solely on LOOCV classification errors may not be reliable."

the authors use random splitting to determine the number of PLS components in GPLS, but I'm still not sure how to choose the right number of PLS components for my data set.

I used the function errorest() from package ipred to estimate the error rates und gpls() with Firth procedure switched on. The attached PDF Graphik illustrates the problem for my data set.

S_n is the model sensitivity and S_p the model specifity. With 4 component I get the best crossvalidation error rate 17% and with 5 components the best bootstrap error rate 9%, but the sensitivity of the model is only 11% ! If one choose 13 components, one gets 100% sensitivity and 100% specifity and CV error is 34% and the boostrap error is 40% and the risk that the model is overtrained is higher.

How much components should I choose now to get the best GPLS model?

R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Fri Jul 29 21:25:04 2005

This archive was generated by hypermail 2.1.8 : Sun 23 Oct 2005 - 14:59:03 EST