[R] Clustering large data matrix

From: Dani Valverde <daniel.valverde_at_uab.cat>
Date: Thu, 06 Mar 2008 11:52:46 +0100

I have a large data matrix (68x13112), each row corresponding to one observation (patients) and each column corresponding to the variables (points within an NMR spectrum). I would like to carry out some kind of clustering on these data to see how many clusters are there. I have tried the function clara() from the package cluster. If I use the matrix as is, I can perform the clara analysis but when I call clusplot() I get this error:

Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) : 'princomp' can only be used with more units than variables

Then, I reduce the dimensionality by using the function prcomp(). Then I take the 13 first principal components (80%< variability) and I carry out the clara() analysis again. Then, I call the clusplot() function again and voilà!, it works. The problem is that clusplot() only represents the two first components of my prcomp() analysis, which represents only 15% of the variability.
So, my questions are 1) is clara() a proper way to analyze such a large data set? and 2) Is there an appropiate method for graphic plotting of my data, that takes into account the whole variability if my data, not just two principal components?
Many thanks.


Daniel Valverde Saubí

Grup de Biologia Molecular de Llevats
Facultat de Veterinària de la Universitat Autònoma de Barcelona
Edifici V, Campus UAB
08193 Cerdanyola del Vallès- SPAIN

Centro de Investigación Biomédica en Red
en Bioingeniería, Biomateriales y
Nanomedicina (CIBER-BBN)

Grup d'Aplicacions Biomèdiques de la RMN
Facultat de Biociències
Universitat Autònoma de Barcelona
Edifici Cs, Campus UAB
08193 Cerdanyola del Vallès- SPAIN
+34 93 5814126

R-help_at_r-project.org mailing list
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Thu 06 Mar 2008 - 10:59:37 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 06 Mar 2008 - 13:30:19 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive