From: Alison Callahan <alison.callahan_at_gmail.com>

Date: Wed, 27 Apr 2011 10:16:01 -0400

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 27 Apr 2011 - 14:21:53 GMT

Date: Wed, 27 Apr 2011 10:16:01 -0400

Hi Dennis,

My replies are in-line.

On Tue, Apr 26, 2011 at 9:15 PM, Dennis Murphy <djmuser_at_gmail.com> wrote:

*> Hi:
**>
**> My view, which may well be narrow, is that techniques like PLS and PCR
**> are useful fit procedures, but I would be very leery about using them
**> as prediction machines. With new data, why should a similar set of
**> principal components emerge? Why should the ordering be (close to) the
**> same? Why should features present in the training data necessarily be
**> present in test data? And if the PCs vary considerably from one set of
**> data to another, what's the point of prediction, since the covariate
**> set is variable from one iteration to the next? Thinking a little more
**> mathematically, why should I believe that the same set of basis
**> functions (covariates + PCs) would reasonably apply to future data?
**> One problem, as I see it, is that the principal components, when used
**> as basis functions, are functions of the training data; in that
**> context, why is it believable that they would well predict future
**> data? [If this is Greek to you (or 'Kling-on', as one of my friends
**> says), the basis functions in regression are the columns of the model
**> matrix X, which map to the terms in the 'linear predictor'.] One of
**> the potential problems is that the effective dimension of the reduced
**> PC space may well change from one data set to the next. If all PCs are
**> retained, then there is a serious danger of overfitting, which is a
**> serious problem in prediction.
**>
**> If you're going to contemplate using such models for prediction, I
**> would seriously consider looking into model validation procedures;
**> they should provide some clue about how well a fitted model predicts
**> to new cases. One of the best treatments of the subject I know is
**> Frank Harrell's Regression Modeling Strategies book (which I believe
**> will have a new edition out within the next couple of months). There
**> is a current thread about this topic re logistic regression validation
**> where the OP has done a nice job of working through the process -
**> Prof. Harrell has chimed in a few times with some nice comments and
**> observations. Most of the code to do this kind of thing in R resides
**> in the rms package; see ?validate and its related functions. I don't
**> know if it can be applied to PLS/PCR models (rather doubtful) but the
**> methodology is what is important; e.g., the estimation of optimism in
**> various figures of merit (e.g., R^2, MSE) when applied over a number
**> of test sets, which provides an indication of how much bias is present
**> in the fitted model due to potential overfitting. The process relies
**> heavily on bootstrapping, so is in some sense vulnerable to the issues
**> that arise with the bootstrap (e.g., population undercoverage), but in
**> very large training sets this becomes less of a problem. If you can
**> validate a PCR model and provide evidence to back it up, then most
**> people (present company included) would have less ammunition to attack
**> your prediction model.
**>
**> Thank you for these suggestions. The PLS package I am using does include
*

methods for cross validation to evaluate the quality of PCR/PLSR models, as
well as for selecting the optimal number of components to use for predicting
using a given model to avoid over fitting. I will also have a look at the
RMS package.

*>
**> On Tue, Apr 26, 2011 at 11:26 AM, Alison Callahan
*

> <alison.callahan@gmail.com> wrote:

*> > Hello again all,
**> >
**> > I am responding to my own earlier post about a "non-conformable
**> arguments"
**> > error with the predict() function of the pls package (
**> > http://cran.r-project.org/web/packages/pls/) in R 2.13.0 (running in
**> Ubuntu
**> > 10.10).
**> >
**> > I believe I have narrowed down the cause of the error. My new
**> understanding
**> > is that if the test data to be predicted using a regression model (where
**> the
**> > test data is passed in as 'newdata' to the predict() function) does not
**> > contain all possible levels of factors in the training data then the
**> > predict() function returns a "non-conformable arguments" error.
**> >
**> > However, this seems like an odd behaviour to me. Surely not all new data
**> for
**> > which the dependent variable(s) are to be predicted will contain all
**> levels
**> > of a factor present in the training data. Can someone shed some light on
**> why
**> > the predict() function of the pls package has this behaviour? And how to
**> > avoid it if possible in a way that doesn't involve users having to insert
**> > dummy values in new data?
**>
**> I don't find this odd at all; rather, I find it comforting. From an R
**> programming perspective, the factors in your newdata should have
**> exactly the same defined levels as those in the training data. You
**> could do this with something like
**>
**> newdata$somefactor <- factor(newdata$somefactor, levels =
**> levels(trainingdata$somefactor))
**>
**> What happens if, in future data, one or more new levels of a factor
**> arise that were not in the training data used to build the prediction
**> model?
**>
**>
*

I absolutely agree with you. New levels for factors in future data that
didn't exist in training data used would of course be a problem for
predicting. However, in my case, I am trying to use predict() on new data
that has a *subset* of the factors present in the training data, and I am
getting a "non-conformable arguments" error. For example, my training data
has levels A,B,C,D and E for a given factor, while my test data contains
only levels B,C and D.

Being somewhat new to R, I confused the values of the factor in the new data with the possible levels of that factor. When I specified that the levels of the factor in my test data were to be the same as in the training data, I did not get the "non-conformable arguments" error.

Thanks!

Alison

Dennis

*> >
**> > Thanks,
**> >
*

> > Alison

*> >
**> > On Mon, Apr 18, 2011 at 6:18 PM, Alison Callahan
**> > <alison.callahan_at_gmail.com>wrote:
**> >
**> >> Hello all,
**> >>
**> >> I have generated a principal components regression model using the pcr()
**> >> function from the PLS package (R version 2.13.0). I am getting a
**> >> "non-conformable arguments" error when I try to use the predict()
**> function
**> >> on new data, but only when I try to read in the new data from a separate
**> >> file.
**> >>
**> >> More specifically, when my data looks like this
**> >>
**> >> #########training data #1#################
**> >>
**> >> var1 var2 var3 response train
**> >> 1 2 type1 33
**> >> TRUE
**> >> 2 23 type2 44
**> TRUE
**> >> .....
**> >> .......
**> >> 18 11 type1 45
**> >> FALSE
**> >>
**> >>
**> >> and I use the predict() function from the PLS package as in the example
**> >> from http://rss.acs.unt.edu/Rdoc/library/pls/html/predict.mvr.html,
**> e.g.
**> >>
**> >> ###################################
**> >> mydata <- read.csv("mydata.csv", header=TRUE)
**> >>
**> >> mydata <- data.frame(mydata)
**> >>
**> >> pcrmodel <- pcr(response ~ var1+var2+var3, data = mydata[mydata$train,])
**> >>
**> >> predict(pcrmodel, type = "response", newdata = mydata[!mydata$train,])
**> >>
**> >> ###################################
**> >>
**> >> the code works, and the model predicts new values for the "response"
**> >> variable rows where train=FALSE.
**> >>
**> >> However, as soon as I put the rows where train = FALSE into a separate
**> file
**> >> and remove the "train" column so that my training data looks like this:
**> >>
**> >> #########training data #2 ################
**> >> var1 var2 var3 response
**> >> 1 2 type1 33
**> >> 2 23 type2 44
**> >> .....
**> >>
**> >>
**> >> and my new test data, saved in a separate file (say "newdata.csv") looks
**> >> like this
**> >>
**> >> ########test data in separate file, newdata.csv ###############
**> >> var1 var2 var3 response
**> >> 3 5 type1 23
**> >> 4 7 type2 30
**> >> .....
**> >> 18 11 type1 45
**> >>
**> >> if I train a PCR model using the training data #2 and try to predict
**> with
**> >> the resulting model and the data from "newdata.csv", e.g.,
**> >>
**> >> ##################################
**> >> trainingdata <- read.csv("mydata_without_train_column.csv", header=TRUE)
**> >>
**> >> trainingdata <- data.frame(trainingdata)
**> >>
**> >> testingdata <- read.csv("newdata.csv", header=TRUE)
**> >>
**> >> testingdata <- data.frame(testingdata)
**> >>
**> >> pcrmodel2 <- pcr(response ~ var1+var2+var3, data = trainingdata)
**> >>
**> >> predict(pcrmodel, type = "response", newdata = testingdata)
**> >> ##############################
**> >>
**> >> I get the following error:
**> >>
**> >> "Error in newX %*% B : non-conformable arguments"
**> >>
**> >> I don't understand why I get this error only when I put the non-training
**> >> data into a separate file from the training data and load it as a
**> separate
**> >> object. Any help is appreciated,
**> >>
**> >> Alison
**> >>
**> >
**> > [[alternative HTML version deleted]]
**> >
**> > ______________________________________________
**> > R-help_at_r-project.org mailing list
**> > https://stat.ethz.ch/mailman/listinfo/r-help
**> > PLEASE do read the posting guide
**> http://www.R-project.org/posting-guide.html
**> > and provide commented, minimal, self-contained, reproducible code.
**> >
**>
*

[[alternative HTML version deleted]]

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 27 Apr 2011 - 14:21:53 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Wed 27 Apr 2011 - 14:30:34 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*