# Re: [R] finding centroids of clusters created with hclust

From: Moritz Lennert <mlennert_at_club.worldonline.be>
Date: Mon 15 May 2006 - 01:08:04 EST

Dear Gavin,

Gavin Simpson wrote:
> On Wed, 2006-05-10 at 18:59 +0200, Moritz Lennert wrote:

```>> Replying to myself for the record:
>>
>> Moritz Lennert wrote:
>>> Hello,
>>>
>>> Can someone point me to documentation or ideas on how to calculate the
>>> centroids of clusters identified with hclust ?
>>>
>>> I would like to be able to chose the number of clusters (in the style of
>>> cutree) and then get the centroids of these clusters.
>>>
>>> This seems like a quite obvious task to me, but I haven't been able to
>>> put my hands on a relevant command.
```

>
> Sorry, Moritz, I meant to reply to your original post, but deleted it
> from my emailer accidentally and hadn't had chance to use the archives
>
> Anyway, Venables and Ripley's Modern Applied Statistics with S (4th Ed)
> [and earlier editions - it is in my 3rd Edition for example] has an
> example of doing what you want to do on page 318 of the 4th Edition.
> They use the centre's of the hclust results as starting points for a
> k-means, so we only need the preliminary bits of their example:
>
> library(MASS)
> swiss.x <- as.matrix(swiss)
> h <- hclust(dist(swiss.x), method = "average")
> initial <- tapply(swiss.x, list(rep(cutree(h, 3), ncol(swiss.x)),
> col(swiss.x)),
> mean)
> dimnames(initial) <- list(NULL, dimnames(swiss.x)[])
> initial
>
> Which gives almost the same output as your function:
>
> fun <- function (data, clust) {
> nvars=length(data[1,])
> ntypes=max(clust)
> centroids<-matrix(0,ncol=nvars,nrow=ntypes)
> for(i in 1:ntypes) {
> c<-rep(0,nvars)
> n<-0
> for(j in names(clust[clust==i])) {
> n<-n+1
> c<-c+data[j,]
> }
> centroids[i,]<-c/n
> }
> rownames(centroids)<-c(1:ntypes)
> colnames(centroids)<-colnames(data)
> centroids
> }
>
> fun(swiss.x, cutree(h, 3))
>
> Wrapping the Venables & Ripley version into a function to give the same
>
> ##
> ## clust.means - function to find centroids of clusters
> ## based on example by Venables & Ripley, MASS 4thEd, Page 318 
> ##
> ## x = input data as data.frame or matrix
> ## res.clust = object of class "hclust"
> ## groups = number of groups to cut dendrogram into
> ##
> ## References:
> ##
> ##  Venables, W.N. and Ripley, B.D. (2002) Modern Applied Statistics
> ## with S. 4th Edition. Springer.
> clust.means <- function(x, res.clust, groups)
> {
> if(!is.matrix(x))
> x <- as.matrix(x)
> means <- tapply(x, list(rep(cutree(res.clust, groups), ncol(x)),
> col(x)),
> mean)
> dimnames(means) <- list(NULL, dimnames(x)[])
> return(as.data.frame(means))
> }
>
> clust.means(swiss, h, 3)

I have a weird error happening here:

when I run the line

means <- tapply(x, list(rep(cutree(res.clust, groups), ncol(x)), col(x)), mean)

directly on the command line, it works. But when I try to run the clust.means function, I get:

Error in rep(cutree(res.clust, groups), ncol(x)) :

could not find function "cutree"

```>> system.time(for(i in 1:10000) fun(swiss.x, cutree(h, 3)))
```

>  8.917 0.000 9.695 0.000 0.000
```>> system.time(for(i in 1:10000) clust.means(swiss, h, 3))
```

>  31.642 0.008 35.348 0.000 0.000
>
> But I think the example is instructive about using R. Sometimes
> vectorisation can make a big time saving over a loop - here it doesn't.

Yes, thank you very much. I will have to read up a bit more on tapply to be able to understand it completely, but it is certainly a good learning example.

Thanks,

Moritz

R-help@stat.math.ethz.ch mailing list