From: Moritz Lennert <mlennert_at_club.worldonline.be>

Date: Mon 15 May 2006 - 01:08:04 EST

*>
*

> Sorry, Moritz, I meant to reply to your original post, but deleted it

*> from my emailer accidentally and hadn't had chance to use the archives
*

*> to follow up.
*

*>
*

*> Anyway, Venables and Ripley's Modern Applied Statistics with S (4th Ed)
*

*> [and earlier editions - it is in my 3rd Edition for example] has an
*

*> example of doing what you want to do on page 318 of the 4th Edition.
*

*> They use the centre's of the hclust results as starting points for a
*

*> k-means, so we only need the preliminary bits of their example:
*

*>
*

*> library(MASS)
*

*> swiss.x <- as.matrix(swiss)
*

*> h <- hclust(dist(swiss.x), method = "average")
*

*> initial <- tapply(swiss.x, list(rep(cutree(h, 3), ncol(swiss.x)),
*

*> col(swiss.x)),
*

*> mean)
*

*> dimnames(initial) <- list(NULL, dimnames(swiss.x)[[2]])
*

*> initial
*

*>
*

*> Which gives almost the same output as your function:
*

*>
*

*> fun <- function (data, clust) {
*

*> nvars=length(data[1,])
*

*> ntypes=max(clust)
*

*> centroids<-matrix(0,ncol=nvars,nrow=ntypes)
*

*> for(i in 1:ntypes) {
*

*> c<-rep(0,nvars)
*

*> n<-0
*

*> for(j in names(clust[clust==i])) {
*

*> n<-n+1
*

*> c<-c+data[j,]
*

*> }
*

*> centroids[i,]<-c/n
*

*> }
*

*> rownames(centroids)<-c(1:ntypes)
*

*> colnames(centroids)<-colnames(data)
*

*> centroids
*

*> }
*

*>
*

*> fun(swiss.x, cutree(h, 3))
*

*>
*

*> Wrapping the Venables & Ripley version into a function to give the same
*

*> output as your function:
*

*>
*

*> ##
*

*> ## clust.means - function to find centroids of clusters
*

*> ## based on example by Venables & Ripley, MASS 4thEd, Page 318 [1]
*

*> ##
*

*> ## x = input data as data.frame or matrix
*

*> ## res.clust = object of class "hclust"
*

*> ## groups = number of groups to cut dendrogram into
*

*> ##
*

*> ## References:
*

*> ##
*

*> ## [1] Venables, W.N. and Ripley, B.D. (2002) Modern Applied Statistics
*

*> ## with S. 4th Edition. Springer.
*

*> clust.means <- function(x, res.clust, groups)
*

*> {
*

*> if(!is.matrix(x))
*

*> x <- as.matrix(x)
*

*> means <- tapply(x, list(rep(cutree(res.clust, groups), ncol(x)),
*

*> col(x)),
*

*> mean)
*

*> dimnames(means) <- list(NULL, dimnames(x)[[2]])
*

*> return(as.data.frame(means))
*

*> }
*

*>
*

*> clust.means(swiss, h, 3)
*

> [1] 8.917 0.000 9.695 0.000 0.000

> [1] 31.642 0.008 35.348 0.000 0.000

*>
*

*> But I think the example is instructive about using R. Sometimes
*

*> vectorisation can make a big time saving over a loop - here it doesn't.
*

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Mon May 15 01:11:55 2006

Date: Mon 15 May 2006 - 01:08:04 EST

Dear Gavin,

Gavin Simpson wrote:

> On Wed, 2006-05-10 at 18:59 +0200, Moritz Lennert wrote:

>> Replying to myself for the record: >> >> Moritz Lennert wrote: >>> Hello, >>> >>> Can someone point me to documentation or ideas on how to calculate the >>> centroids of clusters identified with hclust ? >>> >>> I would like to be able to chose the number of clusters (in the style of >>> cutree) and then get the centroids of these clusters. >>> >>> This seems like a quite obvious task to me, but I haven't been able to >>> put my hands on a relevant command.

> Sorry, Moritz, I meant to reply to your original post, but deleted it

I have a weird error happening here:

when I run the line

means <- tapply(x, list(rep(cutree(res.clust, groups), ncol(x)), col(x)), mean)

directly on the command line, it works. But when I try to run the clust.means function, I get:

Error in rep(cutree(res.clust, groups), ncol(x)) :

could not find function "cutree"

> Your function is faster here:

*>
*

>> system.time(for(i in 1:10000) fun(swiss.x, cutree(h, 3)))

> [1] 8.917 0.000 9.695 0.000 0.000

>> system.time(for(i in 1:10000) clust.means(swiss, h, 3))

> [1] 31.642 0.008 35.348 0.000 0.000

Yes, thank you very much. I will have to read up a bit more on tapply to be able to understand it completely, but it is certainly a good learning example.

Thanks,

Moritz

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Mon May 15 01:11:55 2006

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.1.8, at Mon 15 May 2006 - 04:10:07 EST.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*