Re: [R] Otpmial initial centroid in kmeans

From: Gavin Simpson <gavin.simpson_at_ucl.ac.uk>
Date: Thu, 03 Jul 2008 09:33:12 +0100

On Thu, 2008-07-03 at 11:35 +0800, Chua Siang Li wrote:
> Helo there. I am using kmeans of base package to cluster my customers. As
> the results of kmeans is dependent on the initial centroid, may I know:
> 1) how can we specify the centroid in the R function? (I don't want random
> starting pt)

You can specify coordinates on the variables you are clustering for the k centroids you wish to start from. You pass this as argument 'centers'. So you can come up with any centroids you wish to start from.

One option here is to do a hierarchical clustering (using say the average link or Ward's method) of your data, select a number of clusters and computer the centroids of those clusters, then use those centroids as the starting points for kmeans(). MASS (the book) by Venables and Ripley (2002, Modern Applied Statistics with S 4th Ed., Springer) has an example and R scripts to follow. It is in the multivariate chapter (sorry I can't be more specific, my copy of the book is at work). The R scripts come with the MASS package (in the VR bundle) that is part of R. So have a look for them in your installation. On my linux box they are in:

R_HOME/library/MASS/scripts/

where R_HOME is the location where R is installed or running from.

> 2) how to determine the optimal (if not, a good) centroid to start with? (I
> am not after the fixed seed solution as it only ensure that the cluster is
> the same at every run but not necessary a good cluster.)

For anything other than small problems I suspect that you either can't or you can't do it in a reasonable amount of time. There are a vast number of possible configurations to evaluate. The recommendation is to use several random starts and compare them or use the best solution.

kmeans() has argument nstart to specify how many random starts to try.

cascadeKM() in package vegan allows you to do the many random starts and it retains the best solution for k = 2, ..., n, where n is specified by the user. This function has two criteria to evaluate the optimal k (for the k's tried) so can guide you as to how many clusters to retain and then use the best of the random starts for that k. But remember, you haven't tried *all* solutions so these criteria are a guide only.

HTH G

> Many Thanks.
> siangli
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 03 Jul 2008 - 08:45:33 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 03 Jul 2008 - 09:31:57 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive