From: Myers, Brent <MyersDB_at_missouri.edu>

Date: Sun, 11 May 2008 19:42:54 -0500

DF

DF <- data.frame(C1=C1,C2=C2,M=M)

DF

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 12 May 2008 - 00:46:19 GMT

Date: Sun, 11 May 2008 19:42:54 -0500

Two very good responses to this question, but I wonder, Is there some more complete documentation on using this form of model and dataframe construction? I've been using R for ~5 years now and wasn't aware of it.

Response 1: Insert a matrix as a column of the dataframe using I().

var<-1:10 mat<-matrix(101:200,10) mydf<-data.frame(var,I(mat)) str(mydf)

Response 2: An equvalent response plus a demonstration that this model construction technique generalizes at least to lm. But which ends with a question:

C1 <- c(1.1,1.2,1.3,1.4)

C2 <- c(2.1,2.2,2.3,2.4)

M <- cbind(M1=c(11.1,11.2,11.3,11.4),

M2=c(12.1,12.2,12.3,12.4)) DF <- data.frame(C1=C1,C2=C2,M=M)

"Would you have to "spell out" the interaction term[s] in additional columns of M?"

Hmmm, interesting! I hadn't been aware of this aspect of formula and dataframe construction for modellinng, until you pointed it out!

This response had a very useful example, see excerpted below the initial question...

Thanks responders,

Brent

> There is a very useful and apparently fundamental feature of R (or of

*> the package pls) which I don't understand.
**>
**> For datasets with many independent (X) variables such as chemometric
**> datasets there is a convenient formula and dataframe construction that
*

> allows one to access the entire X matrix with a single term.

*>
**> Consider the gasoline dataset available in the pls package. For the
**> model statement in the plsr function one can write: Octane ~ NIR
**>
**> NIR refers to a (wide) matrix which is a portion of a dataframe. The
**> naming of the columns is of the form: 'NIR.xxxx nm'
**>
**> names(gasoline) returns...
**>
**> $names
**> [1] "octane" "NIR"
**>
**> instead of...
**>
**> $names
**> [1] "octane" "NIR.1000 nm" "NIR.1001 nm" ...
**>
**> How do I construct and manipulate such dataframes and the column names
*

> that go with?

*>
**> Does the use of these types of formulas and dataframes generalize to
**> other modeling functions?
**>
**> Some specific clues on a help search might be enough, I've tried many.
**>
**> Regards,
**> Brent
*

I don't have the 'gasoline' dataset to hand, but I can produce something to which your descrption applies as follows:

C1 <- c(1.1,1.2,1.3,1.4) C2 <- c(2.1,2.2,2.3,2.4) M <- cbind(M1=c(11.1,11.2,11.3,11.4), M2=c(12.1,12.2,12.3,12.4))DF <- data.frame(C1=C1,C2=C2,M=M)

DF

# C1 C2 M.M1 M.M2 # 1 1.1 2.1 11.1 12.1 # 2 1.2 2.2 11.2 12.2 # 3 1.3 2.3 11.3 12.3 # 4 1.4 2.4 11.4 12.4

so the two columns C1 and C2 have gone in as named, and the matrix M (with named columns M1 and M2) has gone in with columns M.M1, M.M2

Now let's fuzz the numbers a bit, so that the lm() fit makes sense:

C1 <- C1 + round(0.1*runif(4),2) C1 <- C1 + round(0.1*runif(4),2) M <- cbind(M1=c(11.1,11.2,11.3,11.4), M2=c(12.1,12.2,12.3,12.4)) + round(0.1*runif(8),2)

DF <- data.frame(C1=C1,C2=C2,M=M)

DF

# C1 C2 M.M1 M.M2 # 1 1.21 2.1 11.19 12.13 # 2 1.34 2.2 11.23 12.23 # 3 1.38 2.3 11.36 12.30 # 4 1.50 2.4 11.43 12.48

summary(lm(C1 ~ M),data=DF)

# Call: # lm(formula = C1 ~ M) # Residuals: # 1 2 3 4 # -0.02422 0.02448 0.01309 -0.01335 # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -8.28435 2.48952 -3.328 0.186 # MM1 -0.05411 0.66909 -0.081 0.949 # MM2 0.83463 0.50687 1.647 0.347 # Residual standard error: 0.03919 on 1 degrees of freedom # Multiple R-Squared: 0.9642, Adjusted R-squared: 0.8925 # F-statistic: 13.46 on 2 and 1 DF, p-value: 0.1893

In other words, a perfectly standard LM fit, equivalent to

summary(lm(C1 ~ M[,1]+M[,2]))

(as you can check). So all that looks straightforward.

One thing, however, is not clear to me in this scenario. Suppose, for example, that the columns M1 and M2 of M were factors (and that you had more rows than I've used above, so that the fit is non-trivial).

Then, in the standard specification of an LM, you could write

summary(lm(C1 ~ M[,1]*M[,2]))

and get the main effects and interactions. But how would you do that in the other type of specification:

Where you used

summary(lm(C1 ~ M, data=DF))

to get the equivalent of

summary(lm(C1 ~ M[,1]+M[,2]))

what would you use to get the equivalent of
summary(lm(C1 ~ M[,1]*M[,2]))??

Would you have to "spell out" the interaction term[s] in additional columns of M?

Hmmm, interesting! I hadn't been aware of this aspect of formula and dataframe construction for modellinng, until you pointed it out!

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 12 May 2008 - 00:46:19 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Mon 12 May 2008 - 02:30:40 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*