Re: [R] Predicting with a principal component regression model: "non-conformable arguments" error

From: Alison Callahan <alison.callahan_at_gmail.com>
Date: Wed, 27 Apr 2011 10:16:01 -0400

Hi Dennis,

My replies are in-line.

On Tue, Apr 26, 2011 at 9:15 PM, Dennis Murphy <djmuser_at_gmail.com> wrote:

> Hi:
>
> My view, which may well be narrow, is that techniques like PLS and PCR
> are useful fit procedures, but I would be very leery about using them
> as prediction machines. With new data, why should a similar set of
> principal components emerge? Why should the ordering be (close to) the
> same? Why should features present in the training data necessarily be
> present in test data? And if the PCs vary considerably from one set of
> data to another, what's the point of prediction, since the covariate
> set is variable from one iteration to the next? Thinking a little more
> mathematically, why should I believe that the same set of basis
> functions (covariates + PCs) would reasonably apply to future data?
> One problem, as I see it, is that the principal components, when used
> as basis functions, are functions of the training data; in that
> context, why is it believable that they would well predict future
> data? [If this is Greek to you (or 'Kling-on', as one of my friends
> says), the basis functions in regression are the columns of the model
> matrix X, which map to the terms in the 'linear predictor'.] One of
> the potential problems is that the effective dimension of the reduced
> PC space may well change from one data set to the next. If all PCs are
> retained, then there is a serious danger of overfitting, which is a
> serious problem in prediction.
>
> If you're going to contemplate using such models for prediction, I
> would seriously consider looking into model validation procedures;
> they should provide some clue about how well a fitted model predicts
> to new cases. One of the best treatments of the subject I know is
> Frank Harrell's Regression Modeling Strategies book (which I believe
> will have a new edition out within the next couple of months). There
> is a current thread about this topic re logistic regression validation
> where the OP has done a nice job of working through the process -
> Prof. Harrell has chimed in a few times with some nice comments and
> observations. Most of the code to do this kind of thing in R resides
> in the rms package; see ?validate and its related functions. I don't
> know if it can be applied to PLS/PCR models (rather doubtful) but the
> methodology is what is important; e.g., the estimation of optimism in
> various figures of merit (e.g., R^2, MSE) when applied over a number
> of test sets, which provides an indication of how much bias is present
> in the fitted model due to potential overfitting. The process relies
> heavily on bootstrapping, so is in some sense vulnerable to the issues
> that arise with the bootstrap (e.g., population undercoverage), but in
> very large training sets this becomes less of a problem. If you can
> validate a PCR model and provide evidence to back it up, then most
> people (present company included) would have less ammunition to attack
> your prediction model.
>
> Thank you for these suggestions. The PLS package I am using does include
methods for cross validation to evaluate the quality of PCR/PLSR models, as well as for selecting the optimal number of components to use for predicting using a given model to avoid over fitting. I will also have a look at the RMS package.

>
> On Tue, Apr 26, 2011 at 11:26 AM, Alison Callahan
> <alison.callahan@gmail.com> wrote:
> > Hello again all,
> >
> > I am responding to my own earlier post about a "non-conformable
> arguments"
> > error with the predict() function of the pls package (
> > http://cran.r-project.org/web/packages/pls/) in R 2.13.0 (running in
> Ubuntu
> > 10.10).
> >
> > I believe I have narrowed down the cause of the error. My new
> understanding
> > is that if the test data to be predicted using a regression model (where
> the
> > test data is passed in as 'newdata' to the predict() function) does not
> > contain all possible levels of factors in the training data then the
> > predict() function returns a "non-conformable arguments" error.
> >
> > However, this seems like an odd behaviour to me. Surely not all new data
> for
> > which the dependent variable(s) are to be predicted will contain all
> levels
> > of a factor present in the training data. Can someone shed some light on
> why
> > the predict() function of the pls package has this behaviour? And how to
> > avoid it if possible in a way that doesn't involve users having to insert
> > dummy values in new data?
>
> I don't find this odd at all; rather, I find it comforting. From an R
> programming perspective, the factors in your newdata should have
> exactly the same defined levels as those in the training data. You
> could do this with something like
>
> newdata$somefactor <- factor(newdata$somefactor, levels =
> levels(trainingdata$somefactor))
>
> What happens if, in future data, one or more new levels of a factor
> arise that were not in the training data used to build the prediction
> model?
>
>
I absolutely agree with you. New levels for factors in future data that didn't exist in training data used would of course be a problem for predicting. However, in my case, I am trying to use predict() on new data that has a *subset* of the factors present in the training data, and I am getting a "non-conformable arguments" error. For example, my training data has levels A,B,C,D and E for a given factor, while my test data contains only levels B,C and D.

Being somewhat new to R, I confused the values of the factor in the new data with the possible levels of that factor. When I specified that the levels of the factor in my test data were to be the same as in the training data, I did not get the "non-conformable arguments" error.

Thanks!

Alison

Dennis
> >
> > Thanks,
> >
> > Alison
> >
> > On Mon, Apr 18, 2011 at 6:18 PM, Alison Callahan
> > <alison.callahan_at_gmail.com>wrote:
> >
> >> Hello all,
> >>
> >> I have generated a principal components regression model using the pcr()
> >> function from the PLS package (R version 2.13.0). I am getting a
> >> "non-conformable arguments" error when I try to use the predict()
> function
> >> on new data, but only when I try to read in the new data from a separate
> >> file.
> >>
> >> More specifically, when my data looks like this
> >>
> >> #########training data #1#################
> >>
> >> var1 var2 var3 response train
> >> 1 2 type1 33
> >> TRUE
> >> 2 23 type2 44
> TRUE
> >> .....
> >> .......
> >> 18 11 type1 45
> >> FALSE
> >>
> >>
> >> and I use the predict() function from the PLS package as in the example
> >> from http://rss.acs.unt.edu/Rdoc/library/pls/html/predict.mvr.html,
> e.g.
> >>
> >> ###################################
> >> mydata <- read.csv("mydata.csv", header=TRUE)
> >>
> >> mydata <- data.frame(mydata)
> >>
> >> pcrmodel <- pcr(response ~ var1+var2+var3, data = mydata[mydata$train,])
> >>
> >> predict(pcrmodel, type = "response", newdata = mydata[!mydata$train,])
> >>
> >> ###################################
> >>
> >> the code works, and the model predicts new values for the "response"
> >> variable rows where train=FALSE.
> >>
> >> However, as soon as I put the rows where train = FALSE into a separate
> file
> >> and remove the "train" column so that my training data looks like this:
> >>
> >> #########training data #2 ################
> >> var1 var2 var3 response
> >> 1 2 type1 33
> >> 2 23 type2 44
> >> .....
> >>
> >>
> >> and my new test data, saved in a separate file (say "newdata.csv") looks
> >> like this
> >>
> >> ########test data in separate file, newdata.csv ###############
> >> var1 var2 var3 response
> >> 3 5 type1 23
> >> 4 7 type2 30
> >> .....
> >> 18 11 type1 45
> >>
> >> if I train a PCR model using the training data #2 and try to predict
> with
> >> the resulting model and the data from "newdata.csv", e.g.,
> >>
> >> ##################################
> >> trainingdata <- read.csv("mydata_without_train_column.csv", header=TRUE)
> >>
> >> trainingdata <- data.frame(trainingdata)
> >>
> >> testingdata <- read.csv("newdata.csv", header=TRUE)
> >>
> >> testingdata <- data.frame(testingdata)
> >>
> >> pcrmodel2 <- pcr(response ~ var1+var2+var3, data = trainingdata)
> >>
> >> predict(pcrmodel, type = "response", newdata = testingdata)
> >> ##############################
> >>
> >> I get the following error:
> >>
> >> "Error in newX %*% B : non-conformable arguments"
> >>
> >> I don't understand why I get this error only when I put the non-training
> >> data into a separate file from the training data and load it as a
> separate
> >> object. Any help is appreciated,
> >>
> >> Alison
> >>
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help_at_r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>

        [[alternative HTML version deleted]]



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 27 Apr 2011 - 14:21:53 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 27 Apr 2011 - 14:30:34 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive