**From:** Jari Oksanen (*jari.oksanen@oulu.fi*)

**Date:** Sat 29 May 2004 - 15:53:11 EST

**Next message:**Aurelie.Cohas@univ-lyon1.fr: "[R] multiple nesting levels in GEE"**Previous message:**Zhu Wang: "[R] Re: Problem: creating shared objects using lapack and blas"**Next in thread:**n.bouget@laposte.net: "Re: [R] distance in the function kmeans"**Maybe reply:**n.bouget@laposte.net: "Re: [R] distance in the function kmeans"**Maybe reply:**n.bouget@laposte.net: "Re: [R] distance in the function kmeans"

Message-id: <77E46C91-B134-11D8-B848-000A95C76CA8@oulu.fi>

My thread broke as I write this at home and there were no new messages

on this subject after I got home. I hope this still reaches interested

parties.

There are several methods that find centroids (means) from distance

data. Centroid clustering methods do so, and so does classic scaling

a.k.a. metric multidimensional scaling a.k.a. principal co-ordinates

analysis (in R function cmdscale the means are found in C function

dblcen.c in R sources). Strictly this centroid finding only works with

Euclidean distances, but these methods willingly handle any other

dissimilarities (or distances). Sometimes this results in anomalies

like upper levels being below lower levels in cluster diagrams or in

negative eigenvalues in cmdscale. In principle, kmeans could do the

same if she only wanted.

Is it correct to use non-Euclidean dissimilarities when Euclidean

distances were assumed? In my field (ecology) we know that Euclidean

distances are often poor, and some other dissimilarities have better

properties, and I think it is OK to break the rules (or `violate the

assumptions'). Now we don't know what kind of dissimilarities were used

in the original post (I think I never saw this specified), so we don't

know if they can be euclidized directly using ideas of Petzold or

Simpson. They might be semimetric or other sinful dissimilarities, too.

These would be bad in the sense Uwe Ligges wrote: you wouldn't get

centres of Voronoi polygons in original space, not even non-overlapping

polygons. Still they might work better than the original space (who

wants to be in the original space when there are better spaces floating

around?)

The following trick handles the problem euclidizing space implied by

any dissimilarity meaasure (metric or semimetric). Here mdata is your

original (rectangular) data matrix, and dis is any dissimilarity data:

tmp <- cmdscale(dis, k=min(dim(mdata))-1, eig=TRUE)

eucspace <- tmp$points[, tmp$eig > 0.01]

The condition removes axes with negative or almost-zero eigenvalues

that you will get with semimetric dissimilarities.

Then just call kmeans with eucspace as argument. If your dis is

Euclidean, this is only a rotation and kmeans of eucspace and mdata

should be equal. For other types of dis (even for semimetric

dissimilarity) this maps your dissimilarities onto Euclidean space

which in effect is the same as performing kmeans with your original

dissimilarity.

Cheers, jari oksanen

-- Jari Oksanen, Oulu, Finland______________________________________________ R-help@stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

**Next message:**Aurelie.Cohas@univ-lyon1.fr: "[R] multiple nesting levels in GEE"**Previous message:**Zhu Wang: "[R] Re: Problem: creating shared objects using lapack and blas"**Next in thread:**n.bouget@laposte.net: "Re: [R] distance in the function kmeans"**Maybe reply:**n.bouget@laposte.net: "Re: [R] distance in the function kmeans"**Maybe reply:**n.bouget@laposte.net: "Re: [R] distance in the function kmeans"

*
This archive was generated by hypermail 2.1.3
: Mon 31 May 2004 - 23:05:14 EST
*