[R] Cluster analysis, factor variables, large data set

From: Hans Ekbrand <hans_at_sociologi.cjb.net>
Date: Thu, 31 Mar 2011 19:46:27 +0200

Dear R helpers,

I have a large data set with 36 variables and about 50.000 cases. The variabels represent labour market status during 36 months, there are 8 different variable values (e.g. Full-time Employment, Student,...)

Only cases with at least one change in labour market status is included in the data set.

To analyse sub sets of the data, I have used daisy in the cluster-package to create a distance matrix and then used pam (or pamk in the fpc-package), to get a k-medoids cluster-solution. Now I want to analyse the whole set.

clara is said to cope with large data sets, but the first step in the cluster analysis, the creation of the distance matrix must be done by another function since clara only works with numeric data.

Is there an alternative to the daisy -> clara route that does not require as much RAM?

What functions would you recommend for a cluster analysis of this kind of data on large data set?


Hans Ekbrand

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 31 Mar 2011 - 18:02:47 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 31 Mar 2011 - 18:20:26 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive