# [Rd] cmdscale & non-Euclidean dissimilarities

From: Jari Oksanen <jari.oksanen_at_oulu.fi>
Date: Wed, 10 Jun 2009 15:27:27 +0300

Dear R gurus,

I think that cmdscale() function has problems with non-Euclidean distances which have negative eigenvalues. The problems are two-fold:

(1) Wrong eigenvalue is removed: there will be at least one zero
eigenvalue in cmdscale, and the function assumes it is the last one. With non-Euclidean dissimilarities you will have negative eigenvalues, and the zero eigenvalue will be the last positive one before negative eigenvalues. Now the function returns the zero eigenvalue and corresponding zero-eigenvector, but drops the last negative eigenvalue
(which has larger absolute value than any other negative eigenvalue).

(2) Gower (1985) says that with non-Euclidean matrices and negative
eigenvalues you will have imaginary axes, and the distances on imaginary axes (negative eigenvalues) should be subtracted from the distances on real axes (positive eigenvalues). The formulation in the article is like this (Gower 1985, p. 93):

"""
f_{ii} + f_{jj} - 2 f_{ij} = d_{ij}^2 =
\sum_{p=1}^r (l_{pi} - l_{pj})^2 - \sum_{p=r+1}^{r+s} (l_{pi} - l_{pj})^ 2

This is the usual "Pythagorean" representation of squared distances in terms of coordinates $l_{pi} (p = 1, 2 \ldots r+s)$, except that for $p>r$ the coordinates become purely imaginary. """

This also suggests that for GOF (goodness of fit) measure of cmdscale() the negative eigenvalues should be subtracted from the sum of positive eigenvalues. Currently, the function uses two ways: the sum of abs values of eigenvalues (and it should be sum of eigenvalues with their signs), and the sum of above-zero eigenvalues for the total. The latter makes some sense, but the first looks non-Gowerian.

Reference

Gower, J. C. (1985) Properties of Euclidean and non-Euclidean distance matrices. Linear Algebra and its Applications 67, 81--97.

The following change seems to avoid both problems. The change removes only the eigenvalue that is closest to the zero. There may be more than one zero eigenvalue (or of magnitude 1e-17), but this leaves the rest there. It also changes the way the first alternative of GOF is calculated. This changes the code as little as possible, and it still leaves behind some cruft of the old code that assumed that last eigenvalue is the zero eigenvalue.

• R/src/library/stats/R/cmdscale.R (revision 48741) +++ R/src/library/stats/R/cmdscale.R (working copy) @@ -56,6 +56,9 @@ x[non.diag] <- (d[non.diag] + add.c)^2 } e <- eigen(-x/2, symmetric = TRUE) + zeroeig <- which.min(abs(e$values)) + e$values <- e$values[-zeroeig] + e$vectors <- e$vectors[ , -zeroeig, drop = FALSE] ev <- e$values[1L:k] if(any(ev < 0)) warning(gettextf("some of the first %d eigenvalues are < 0", k), @@ -63,9 +66,9 @@ points <- e$vectors[, 1L:k, drop = FALSE] %*% diag(sqrt(ev), k) dimnames(points) <- list(rn, NULL) if (eig || x.ret || add) { - evalus <- e$values[-n] + evalus <- e\$values list(points = points, eig = if(eig) ev, x = if(x.ret) x, ac = if(add) add.c else 0, - GOF = sum(ev)/c(sum(abs(evalus)), sum(evalus[evalus > 0]))) + GOF = sum(ev)/c(sum(evalus), sum(evalus[evalus > 0]))) } else points }

Best wishes, Jari Oksanen

--
Jari Oksanen -- Dept Biology, Univ Oulu, FI-90014 Oulu, Finland
email jari.oksanen@oulu.fi, homepage http://cc.oulu.fi/~jarioksa/

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Received on Wed 10 Jun 2009 - 12:46:02 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 10 Jun 2009 - 13:35:47 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.