Re: [R] Cluster analysis, factor variables, large data set

From: Hans Ekbrand <hans_at_sociologi.cjb.net>
Date: Thu, 31 Mar 2011 21:17:40 +0200

On Thu, Mar 31, 2011 at 08:48:02PM +0200, Hans Ekbrand wrote:
> On Thu, Mar 31, 2011 at 07:06:31PM +0100, Christian Hennig wrote:
> > Dear Hans,
> >
> > clara doesn't require a distance matrix as input (and therefore
> > doesn't require you to run daisy), it will work with the raw data
> > matrix using
> > Euclidean distances implicitly.
> > I can't tell you whether Euclidean distances are appropriate in this
> > situation (this depends on the interpretation and variables and
> > particularly on how they are scaled), but they may be fine at least
> > after some transformation and standardisation of your variables.
>
> The variables are unordered factors, stored as integers 1:9, where
>
> 1 means "Full-time employment"
> 2 means "Part-time employment"
> 3 means "Student"
> 4 means "Full-time self-employee"
> ...
>
> Does euclidean distances make sense on unordered factors coded as
> integers?

To be clear, here is an extract

> my.df.full[900:910, 16:19]

    PL210F.first.year PL210G.first.year PL210H.first.year PL210I.first.year

900                 2                 2                 1                 2
901                 1                 1                 1                 1
902                 1                 1                 1                 1
903                 2                 2                 2                 2
904                 1                 1                 1                 1
905                 2                 2                 2                 2
906                 7                 8                 2                 7
907                 5                 5                 5                 5
908                 1                 1                 1                 1
909                 1                 1                 1                 1
910                 1                 1                 1                 1

> class(my.df.full[,16])

[1] "integer"



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 31 Mar 2011 - 19:21:28 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 31 Mar 2011 - 19:30:25 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive