Re: [R] Significance of Principal Coordinates

From: Jari Oksanen <jarioksa_at_sun3.oulu.fi>
Date: Wed 16 Mar 2005 - 03:38:59 EST

On Mon, 2005-03-14 at 18:32 +0100, Christian Kamenik wrote:
> Dear all,
>
> I was looking for methods in R that allow assessing the number of
> significant principal coordinates. Unfortunatly I was not very
> successful. I expanded my search to the web and Current Contents,
> however, the information I found is very limited.
> Therefore, I tried to write code for doing a randomization. I would
> highly appriciate if somebody could comment on the following approach. I
> am neither a statistician, nor an R expert... the data matrix I used has
> 72 species (columns) and 167 samples (rows).
>
Earlier this year (Sat, 29 Jan 2005) Jérôme Lemaître asked something similar here under subject "Bootstrapped eigenvector" (but the code I posted then had one bug I know and perhaps some I don't know!). Some ecologists (Donald Jackson, Peres-Neto) have indeed tried to develop methods for PCA, and they could be easily modified for PCoA which is about the same method, in particular with Euclidean distances like you used. So the following two solutions are practically identical (within 2e-15 in the case I tried):

x <- decostand(x, "norm") # in vegan
chordis <- dist(x) # Euclidean is the default, so this is chord distance pcoa <- cmdscale(chordis)
pca <- prcomp(x)

Verify this with:

procrustes(pcoa, pca, choices=1:2) # in vegan

PCoA with row weights is something different, but I really don't know why would you like to do this. I really don't understand what people mean with "significant" eigenvalues, unless they are making Factor Analysis. In PCA, you rotate your data, and you can find low-rank approximations of your data, but how these are rotatations are "significant" is beyond my imagination. Further, resampling with replacement seems to suit poorly to multivariate analysis: it duplicates some rows and so it makes easier to find similar rows that is the ultimate task in PC rotation. It seems that Monte Carlo results are systematically "better" than any original data (only if number of rows is much lower than number of columns this is not disturbing). Also, resampling or shuffling species tends to create communities that are fundamentally different from any real community we have: instead of single or a few abundant species, they may have several or none. With total abundance constraint you can hide the traces of anarchistic community assembly, but not its fundamental fault. So I do think that (1) you cannot use resampling in assessing PCA and its kin, (2) you cannot say what is the meaning of being "significant" in this case, and (3) the number of "significant" axes would only be a function of sample size even here.

Now my hope is that some guru over there gets so irritated that (s)he chastises me for writing such pieces of stupidity, and sends a correct solution here with accompanying code and references to the literature. Let's hope so.

The old truth is that most data sets have 2.5 dimensions (Kruskal): those two that you can show in a printed plot, and that half a dimension that you must explain away in the text. Wouldn't that be a sufficient solution?

cheers, jari oksanen

-- 
Jari Oksanen <jarioksa@sun3.oulu.fi>

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Wed Mar 16 03:44:26 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:30:47 EST