[R] PCA: eigen/princomp vs. svd/prcomp

From: George W. Gilchrist <gwgilc_at_wm.edu>
Date: Sat 21 Jan 2006 - 10:02:10 EST

I am using R 2.2.1 on OS X 10.4.4. I have a question that is partly about R but also about some differences in the loadings when doing principal components using eigen()/princomp() versus prcomp() . Here is the story:

I have a matrix of mean monthly temperatures for 26 sites in the northern and southern hemispheres (26 x 12). I am using PCA to reduce this to one or two variables that capture most of the annual temperature variation among these sites. I am particularly interested in a single vector that captures the overall annual differences among sites. The southern hemisphere sites are 6 months out of phase with the northern, in terms of seasons. So the first question is whether or not to rotate the southern hemisphere data so that Jan=July, Feb=Aug, etc. before PCA. The second question is whether or not to center and scale the data. My gut feeling is, no, as these are all temperatures and the differences in means and variances among months are important.

If I do PCA using eigen()/princomp() on the unrotated, unscaled, and uncentered data, the first PC explains about 60% of the variation and represents the difference in phase between the southern and northern hemispheres. The second PC represents mean temperature and explains about 35% of the variation.

If I use prcomp() on the unrotated, unscaled, and uncentered data, the first PC represents mean temperature and explains >90% of the variation, the second represents the seasonal phase difference and explains less than 5% of the variation. This surprised me, as intuitively I had expected the seasonal phase difference to fall out first, as it did using eigen(). If anyone has an explanation for this, I would love to hear it.

If I center the data, the two methods yield nearly identical results, with the first PC capturing the seasonal phase difference and the second the mean, explaining 60% and 30% of the variances respectively. My intuition (which often is wrong...) says that this is not the right way to do things in this case.

I love the result from prcomp() using the uncentered, unscaled data, but the loadings are so different from the eigenvectors. I am suspicious that something funky is going on here. Does not centering the data cause a problem with the math? I would appreciate any comments.

If I rotate the southern hemisphere data six months out of phase, then the first PC by either method represents mean temperature and the second captures the seasonal difference but again separates the northern and southern hemispheres. The variance explained by the first PC is about 75% using eigen() and 97% using prcomp(). On one hand, this seems like a sensible approach, however it is pretty manipulative of the data. March in Santiago probably is NOT the same as September in San Francisco, as is reflected in the second PC. But again the two methods yield very different amounts of variance explained. Why?

Any thoughts would be very much appreciated!

cheers, George

George W. Gilchrist                        Email #1: gwgilc@wm.edu
Department of Biology, Box 8795          Email #2: kitesci@cox.net
College of William & Mary                    Phone: (757) 221-7751
Williamsburg, VA 23187-8795                    Fax: (757) 221-6483

R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sat Jan 21 10:10:38 2006

This archive was generated by hypermail 2.1.8 : Sat 21 Jan 2006 - 14:11:28 EST