# Re: [R] Cluster analysis, factor variables, large data set

From: Peter Langfelder <peter.langfelder_at_gmail.com>
Date: Thu, 31 Mar 2011 12:22:54 -0700

On Thu, Mar 31, 2011 at 11:48 AM, Hans Ekbrand <hans_at_sociologi.cjb.net> wrote:
>
> The variables are unordered factors, stored as integers 1:9, where
>
> 1 means "Full-time employment"
> 2 means "Part-time employment"
> 3 means "Student"
> 4 means "Full-time self-employee"
> ...
>
> Does euclidean distances make sense on unordered factors coded as
> integers?

It probably doesn't. You said you have some 36 observations for each case, correct? You can turn these 36 observations into a vector of length 36 * 9 on which Euclidean distance will make some sense, namely k changes will produce a distance of sqrt(2*k). For each observation with value p (p between 1 and 9), create a vector r = c(0,0,1,0,...0) where the entry 1 is in the p-th component. Hence, if values p1 and p2 are the same, euclidean distance between r1 and r2 is zero; if they are not the same, Euclidan distance is sqrt(2).

Here's some possible R code:

transform = function(obsVector, maxVal)
{
templateMat = matrix(0, maxVal, maxVal);   diag(templateMat) = 1;

return(as.vector(templateMat[, obsVector])); }

set.seed(10)
n = 4;
m = 5;
max = 4;
data = matrix(sample(c(1:max), n*m, replace = TRUE), m, n);

> data

[,1] [,2] [,3] [,4]

```[1,]    3    3    1    2
[2,]    1    3    3    2
[3,]    3    3    2    4
[4,]    1    2    4    2
[5,]    4    1    4    1

```

trafoData = apply(data, 2, transform, maxVal = max);

> trafoData

[,1] [,2] [,3] [,4]

``` [1,]    0    0    1    0
[2,]    0    0    0    1
[3,]    1    1    0    0
[4,]    0    0    0    0
[5,]    1    0    0    0
[6,]    0    0    0    1
[7,]    0    1    1    0
[8,]    0    0    0    0
[9,]    0    0    0    0
[10,]    0    0    1    0
[11,]    1    1    0    0
[12,]    0    0    0    1
[13,]    1    0    0    0
[14,]    0    1    0    1
[15,]    0    0    0    0
[16,]    0    0    1    0
[17,]    0    1    0    1
[18,]    0    0    0    0
```

[19,] 0 0 0 0
[20,] 1 0 1 0

The code assumes that cases are in columns and observations in rows of data. Examine data and trafoData to see how the transformation works. Once you have the transformed data, simply apply your favorite clustering method that uses Euclidean distance.

HTH, Peter

>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help