# Re: [R] finding centroids of clusters created with hclust

From: Gavin Simpson <gavin.simpson_at_ucl.ac.uk>
Date: Thu 11 May 2006 - 04:17:35 EST

On Wed, 2006-05-10 at 18:59 +0200, Moritz Lennert wrote:
> Replying to myself for the record:
>
> Moritz Lennert wrote:
> > Hello,
> >
> > Can someone point me to documentation or ideas on how to calculate the
> > centroids of clusters identified with hclust ?
> >
> > I would like to be able to chose the number of clusters (in the style of
> > cutree) and then get the centroids of these clusters.
> >
> > This seems like a quite obvious task to me, but I haven't been able to
> > put my hands on a relevant command.

Anyway, Venables and Ripley's Modern Applied Statistics with S (4th Ed) [and earlier editions - it is in my 3rd Edition for example] has an example of doing what you want to do on page 318 of the 4th Edition. They use the centre's of the hclust results as starting points for a k-means, so we only need the preliminary bits of their example:

library(MASS)
swiss.x <- as.matrix(swiss)
h <- hclust(dist(swiss.x), method = "average") initial <- tapply(swiss.x, list(rep(cutree(h, 3), ncol(swiss.x)),

```                                col(swiss.x)),
mean)
```

dimnames(initial) <- list(NULL, dimnames(swiss.x)[[2]]) initial

Which gives almost the same output as your function:

fun <- function (data, clust) {
nvars=length(data[1,])
ntypes=max(clust)
centroids<-matrix(0,ncol=nvars,nrow=ntypes)   for(i in 1:ntypes) {
c<-rep(0,nvars)
n<-0
for(j in names(clust[clust==i])) {

```      n<-n+1
c<-c+data[j,]
```

}
centroids[i,]<-c/n
}
rownames(centroids)<-c(1:ntypes)
colnames(centroids)<-colnames(data)
centroids
}

fun(swiss.x, cutree(h, 3))

Wrapping the Venables & Ripley version into a function to give the same output as your function:

```##
## clust.means - function to find centroids of clusters
## based on example by Venables & Ripley, MASS 4thEd, Page 318 [1]
##
## x            = input data as data.frame or matrix
## res.clust    = object of class "hclust"
## groups       = number of groups to cut dendrogram into
##
## References:
##
## [1] Venables, W.N. and Ripley, B.D. (2002) Modern Applied Statistics
##     with S. 4th Edition. Springer.
```

clust.means <- function(x, res.clust, groups)   {
if(!is.matrix(x))
x <- as.matrix(x)
means <- tapply(x, list(rep(cutree(res.clust, groups), ncol(x)),
```                                  col(x)),
mean)
```

dimnames(means) <- list(NULL, dimnames(x)[[2]])     return(as.data.frame(means))
}

clust.means(swiss, h, 3)

> system.time(for(i in 1:10000) fun(swiss.x, cutree(h, 3)))
[1] 8.917 0.000 9.695 0.000 0.000
>
> system.time(for(i in 1:10000) clust.means(swiss, h, 3))
[1] 31.642 0.008 35.348 0.000 0.000

HTH G

>
> Here's a simple function that does the job for me:
>
> Variables:
>
> data: matrix of original (absolute value) data introduced into hclust or
> HierClust
> clust: result of a 'cutree' call on the results of the hclust or
> HierClust call
>
> Value:
>
> a matrix of relative values of the variables at the centroids of the types
>
>
> function (data, clust) {
> nvars=length(data[1,])
> ntypes=max(clust)
> centroids<-matrix(0,ncol=nvars,nrow=ntypes)
> for(i in 1:ntypes) {
> c<-rep(0,nvars)
> n<-0
> for(j in names(clust[clust==i])) {
> n<-n+1
> c<-c+data[j,]
> }
> centroids[i,]<-c/n
> }
> rownames(centroids)<-c(1:ntypes)
> colnames(centroids)<-colnames(data)
> centroids
> }
>
> Moritz
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help

