Re: [R] finding centroids of clusters created with hclust

From: Gavin Simpson <gavin.simpson_at_ucl.ac.uk>
Date: Thu 11 May 2006 - 04:17:35 EST

On Wed, 2006-05-10 at 18:59 +0200, Moritz Lennert wrote:
> Replying to myself for the record:
>
> Moritz Lennert wrote:
> > Hello,
> >
> > Can someone point me to documentation or ideas on how to calculate the
> > centroids of clusters identified with hclust ?
> >
> > I would like to be able to chose the number of clusters (in the style of
> > cutree) and then get the centroids of these clusters.
> >
> > This seems like a quite obvious task to me, but I haven't been able to
> > put my hands on a relevant command.

Sorry, Moritz, I meant to reply to your original post, but deleted it from my emailer accidentally and hadn't had chance to use the archives to follow up.

Anyway, Venables and Ripley's Modern Applied Statistics with S (4th Ed) [and earlier editions - it is in my 3rd Edition for example] has an example of doing what you want to do on page 318 of the 4th Edition. They use the centre's of the hclust results as starting points for a k-means, so we only need the preliminary bits of their example:

library(MASS)
swiss.x <- as.matrix(swiss)
h <- hclust(dist(swiss.x), method = "average") initial <- tapply(swiss.x, list(rep(cutree(h, 3), ncol(swiss.x)),

                                col(swiss.x)),
                  mean)

dimnames(initial) <- list(NULL, dimnames(swiss.x)[[2]]) initial

Which gives almost the same output as your function:

fun <- function (data, clust) {
  nvars=length(data[1,])
  ntypes=max(clust)
  centroids<-matrix(0,ncol=nvars,nrow=ntypes)   for(i in 1:ntypes) {
    c<-rep(0,nvars)
    n<-0
    for(j in names(clust[clust==i])) {

      n<-n+1
      c<-c+data[j,]

    }
    centroids[i,]<-c/n
  }
  rownames(centroids)<-c(1:ntypes)
  colnames(centroids)<-colnames(data)
  centroids
}

fun(swiss.x, cutree(h, 3))

Wrapping the Venables & Ripley version into a function to give the same output as your function:

##
## clust.means - function to find centroids of clusters
## based on example by Venables & Ripley, MASS 4thEd, Page 318 [1]
##
## x            = input data as data.frame or matrix
## res.clust    = object of class "hclust"
## groups       = number of groups to cut dendrogram into
##
## References:
##
## [1] Venables, W.N. and Ripley, B.D. (2002) Modern Applied Statistics 
##     with S. 4th Edition. Springer.

clust.means <- function(x, res.clust, groups)   {
    if(!is.matrix(x))
      x <- as.matrix(x)
    means <- tapply(x, list(rep(cutree(res.clust, groups), ncol(x)),
                                  col(x)),
                    mean)

    dimnames(means) <- list(NULL, dimnames(x)[[2]])     return(as.data.frame(means))
  }

clust.means(swiss, h, 3)

Your function is faster here:

> system.time(for(i in 1:10000) fun(swiss.x, cutree(h, 3)))
[1] 8.917 0.000 9.695 0.000 0.000
>
> system.time(for(i in 1:10000) clust.means(swiss, h, 3))
[1] 31.642 0.008 35.348 0.000 0.000

But I think the example is instructive about using R. Sometimes vectorisation can make a big time saving over a loop - here it doesn't.

HTH G

>
> Here's a simple function that does the job for me:
>
> Variables:
>
> data: matrix of original (absolute value) data introduced into hclust or
> HierClust
> clust: result of a 'cutree' call on the results of the hclust or
> HierClust call
>
> Value:
>
> a matrix of relative values of the variables at the centroids of the types
>
>
> function (data, clust) {
> nvars=length(data[1,])
> ntypes=max(clust)
> centroids<-matrix(0,ncol=nvars,nrow=ntypes)
> for(i in 1:ntypes) {
> c<-rep(0,nvars)
> n<-0
> for(j in names(clust[clust==i])) {
> n<-n+1
> c<-c+data[j,]
> }
> centroids[i,]<-c/n
> }
> rownames(centroids)<-c(1:ntypes)
> colnames(centroids)<-colnames(data)
> centroids
> }
>
> Moritz
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
*  Note new Address, Telephone & Fax numbers from 6th April 2006  *
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Gavin Simpson
ECRC & ENSIS                  [t] +44 (0)20 7679 0522
UCL Department of Geography   [f] +44 (0)20 7679 0565
Pearson Building              [e] gavin.simpsonATNOSPAMucl.ac.uk
Gower Street                  [w] http://www.ucl.ac.uk/~ucfagls/cv/
London, UK.                   [w] http://www.ucl.ac.uk/~ucfagls/
WC1E 6BT.
>%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Thu May 11 04:22:35 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Mon 15 May 2006 - 02:10:05 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.