Re: [R] Clustering large data matrix

From: Christian Hennig <chrish_at_stats.ucl.ac.uk>
Date: Thu, 06 Mar 2008 12:18:34 +0000 (GMT)

Hi there,

whether clara is a proper way of clustering depends strongly on what your data are and particularly what interpretation or use you want for your clustering. You may do better with a hierarchical method after having defined a proper distance (however this would rather go into statistical consultation and not just R help).

Assuming that you use some reasonable dimension reduction and clustering method, you may get a good visualization of you clustering using the methods available via functions plotcluster/discrproj in package fpc.

Best,
Christian

On Thu, 6 Mar 2008, Dani Valverde wrote:

> Hello,
> I have a large data matrix (68x13112), each row corresponding to one
> observation (patients) and each column corresponding to the variables
> (points within an NMR spectrum). I would like to carry out some kind of
> clustering on these data to see how many clusters are there. I have
> tried the function clara() from the package cluster. If I use the matrix
> as is, I can perform the clara analysis but when I call clusplot() I get
> this error:
>
> Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) :
> 'princomp' can only be used with more units than variables
>
> Then, I reduce the dimensionality by using the function prcomp(). Then I
> take the 13 first principal components (80%< variability) and I carry
> out the clara() analysis again. Then, I call the clusplot() function
> again and voilà!, it works. The problem is that clusplot() only
> represents the two first components of my prcomp() analysis, which
> represents only 15% of the variability.
> So, my questions are 1) is clara() a proper way to analyze such a large
> data set? and 2) Is there an appropiate method for graphic plotting of
> my data, that takes into account the whole variability if my data, not
> just two principal components?
> Many thanks.
> Best,
>
> Dani
>
> --
> Daniel Valverde Saubí
>
> Grup de Biologia Molecular de Llevats
> Facultat de Veterinària de la Universitat Autònoma de Barcelona
> Edifici V, Campus UAB
> 08193 Cerdanyola del Vallès- SPAIN
>
> Centro de Investigación Biomédica en Red
> en Bioingeniería, Biomateriales y
> Nanomedicina (CIBER-BBN)
>
> Grup d'Aplicacions Biomèdiques de la RMN
> Facultat de Biociències
> Universitat Autònoma de Barcelona
> Edifici Cs, Campus UAB
> 08193 Cerdanyola del Vallès- SPAIN
> +34 93 5814126
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 06 Mar 2008 - 12:29:32 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 06 Mar 2008 - 12:30:19 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive