[R] Clustering for variable reduction

From: Gad Abraham <gabraham_at_csse.unimelb.edu.au>
Date: Sun, 06 Apr 2008 16:06:18 +1000


I have a regression model, where the explanatory variables are factors, and I want to include interaction terms, but some combinations occur in the data very infrequently.

Hence, I'm using hclust and cutree to hierarchically cluster the levels, and get new combined levels to regress on.

Ideally, I would like to be able to cut the tree to achieve clusters with at least k observations each. That is, cut the tree at an appropriate height for each branch (combine nodes only when they have fewer than k obs).

AFAIK, cutree cuts at a uniform height and there's no easy way of extracting the number of observations per cluster from hclust (except by assigning the new levels to the data and then counting the occurrences).

Does anyone know of code that does this already?


Gad Abraham
Dept. CSSE and NICTA
The University of Melbourne
Parkville 3010, Victoria, Australia
email: gabraham_at_csse.unimelb.edu.au
web: http://www.csse.unimelb.edu.au/~gabraham

R-help_at_r-project.org mailing list
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Sun 06 Apr 2008 - 06:08:57 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 06 Apr 2008 - 07:30:26 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive