Re: [R] Cluster analysis, factor variables, large data set

From: Peter Langfelder <peter.langfelder_at_gmail.com>
Date: Thu, 31 Mar 2011 12:22:54 -0700

On Thu, Mar 31, 2011 at 11:48 AM, Hans Ekbrand <hans_at_sociologi.cjb.net> wrote:
>
> The variables are unordered factors, stored as integers 1:9, where
>
> 1 means "Full-time employment"
> 2 means "Part-time employment"
> 3 means "Student"
> 4 means "Full-time self-employee"
> ...
>
> Does euclidean distances make sense on unordered factors coded as
> integers?

It probably doesn't. You said you have some 36 observations for each case, correct? You can turn these 36 observations into a vector of length 36 * 9 on which Euclidean distance will make some sense, namely k changes will produce a distance of sqrt(2*k). For each observation with value p (p between 1 and 9), create a vector r = c(0,0,1,0,...0) where the entry 1 is in the p-th component. Hence, if values p1 and p2 are the same, euclidean distance between r1 and r2 is zero; if they are not the same, Euclidan distance is sqrt(2).

Here's some possible R code:

transform = function(obsVector, maxVal)
{
  templateMat = matrix(0, maxVal, maxVal);   diag(templateMat) = 1;

  return(as.vector(templateMat[, obsVector])); }

set.seed(10)
n = 4;
m = 5;
max = 4;
data = matrix(sample(c(1:max), n*m, replace = TRUE), m, n);

> data

     [,1] [,2] [,3] [,4]

[1,]    3    3    1    2
[2,]    1    3    3    2
[3,]    3    3    2    4
[4,]    1    2    4    2
[5,]    4    1    4    1


trafoData = apply(data, 2, transform, maxVal = max);

> trafoData

      [,1] [,2] [,3] [,4]

 [1,]    0    0    1    0
 [2,]    0    0    0    1
 [3,]    1    1    0    0
 [4,]    0    0    0    0
 [5,]    1    0    0    0
 [6,]    0    0    0    1
 [7,]    0    1    1    0
 [8,]    0    0    0    0
 [9,]    0    0    0    0
[10,]    0    0    1    0
[11,]    1    1    0    0
[12,]    0    0    0    1
[13,]    1    0    0    0
[14,]    0    1    0    1
[15,]    0    0    0    0
[16,]    0    0    1    0
[17,]    0    1    0    1
[18,]    0    0    0    0

[19,] 0 0 0 0
[20,] 1 0 1 0

The code assumes that cases are in columns and observations in rows of data. Examine data and trafoData to see how the transformation works. Once you have the transformed data, simply apply your favorite clustering method that uses Euclidean distance.

HTH, Peter

>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 31 Mar 2011 - 19:27:07 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 31 Mar 2011 - 19:30:26 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive