Re: [R] Help in using PCR

From: Gavin Simpson <>
Date: Tue, 01 Jul 2008 08:03:31 +0100

On Tue, 2008-07-01 at 10:54 +1000, Jason Lee wrote:
> Hi,
> Currently I have a dataset of 2400*408. And I would like to apply PCR method
> to study the any correlation between the tests.
> My current data is in data.frame and I have formed horizontal(1-407) to be
> the exact data, and (408) to be my results data(Yes and No)
> I have also binarized these Yes and No to 1 and -1s.
> However, when I refer to PCR manual on R, the example of yarn.pcr <-
> pcr(density ~ NIR, 6, data = yarn, validation = "CV"), I
> am not sure how can I adapt the command based line to my sample dataset.

In the yarn data set, NIR is a matrix with columns representing near infra-red spectra at 268 wavelengths (i.e. variables) on 28 yarns (the samples, 7 of which are a test set). Take a look at:



A matrix is allowed on the rhs of a model formula which is why this works.

This is a reasonably standard model formula in R, something that you'll come across more and more if you use R for a short amount of time. These formulae are a symbolic way of describing the model in the form:

response ~ rhs

where response is (are) the response variable(s) or thing you are trying to predict, ~ means "is modelled by", and rhs contains the definition of the model matrix (i.e. the set of predictor or explanatory variables), such as

density ~ var1 + var2 + var3*var4

(which includes main and interaction terms for var3 and var4 via the use of the '*'). This says that density is modelled as a function of var1, var2, var3 and var4, plus and interaction term between var3 and var4.

In the main, you will see that the rhs normally refers directly to named variables as in my last example. This would be tedious with 268 variables, so in the yarn example a matrix containing these 268 predictor variables is stated, rather than having to name all 268 wavelengths.

You can do this another way though, that I feel is more natural. So lets assume that your data frame contains columns that are named, and that one of these is the response variable, the remaining columns are the predictors. Further assume that this response is called 'myresp', then you can proceed by the following:

cancerv1.pcr <- pcr(myresp ~ . , ncomp = 6, data = cancerv1,

                    validation = "CV")

What this means is myresp is modelled by '.' and '.' is shorthand for all variables in 'data' not currently in the model (i.e. myresp is not included on the rhs). So as long as your data frame contains both the response and the explanatory variables this will work.

This is a fundamental feature of using R's modelling functions. As such you need to be come familiar with model formulae, so take a look at ?formula and also at the relevant section in An Introduction to R:

Or some of the introductory materials in the contributed documentation section of the R website:


> It seems that they label each horizontal (columns) as NIR and followed by
> Density (which is my results data). My doubt is
> do I have to label these data at the first place? If not, what
> variables/command that I should put in place of density?
> cancerv1.pcr<-pcr(cancerv1[,1-407],6,data=cancerv1,validation="CV")?
> Please advise. Thanks.
> [[alternative HTML version deleted]]
> ______________________________________________
> mailing list
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code. mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Tue 01 Jul 2008 - 07:09:03 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 01 Jul 2008 - 09:30:54 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.

list of date sections of archive