Re: [R] Cluster analysis with numeric and categorical variables

From: Christian Hennig <chrish_at_stats.ucl.ac.uk>
Date: Tue, 03 Jun 2008 12:58:40 +0100 (BST)

Dear Miha,

a general way to do this is as follows:
Define a distance measure by aggregating the Euclidean distance on the (X,Y)-space and the trivial 0-1 distance (0 if category is the same) on the categorial variable. Perform cluster analysis (whichever you want) on the resulting distance matrix.

Note that there is more than one way to do this. The 0-1-distance could be incorporated in the definition of the Euclidean distance (instead of (x_i-y_i)^2), or a weighted average of the distances in X-, Y- and categorial space could be computed. Weights of variables (including possibly rescaling) have to be decided. How to do this precisely should depend on the subject matter and prior information about variable importance etc. In absence of such information, you may standardise the variablewise sums of squared pairwise distances to be equal.

Hope this helps (and you can figure out the relevant R code yourself).

Christian

On Tue, 3 Jun 2008, Miha Staut wrote:

> Dear all,
>
> I would like to perform a clustering analysis on a data frame with two coordinate variables (X and Y) and a categorical variable where only a != b can be established. As far as I understood classification analyses, they are not an option as they partition the training set only in k classes of the test set. By searching through the book "Modern Applied Statistics with S" I did not find a satisfactory solution.
>
> I will be grateful for any suggestions.
>
> Best regards
> Miha
>
>
>
> __________________________________________________________
> can.html
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Tue 03 Jun 2008 - 15:22:43 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 03 Jun 2008 - 16:30:35 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive