Re: [R] Clustering large data matrix

From: Andris Jankevics <andza_at_osi.lv>
Date: Thu, 06 Mar 2008 13:47:13 +0200

Hi Dani,

If you are working with NMR data, which data pretreatment methods you are using? 13112 variables for NMR data sounds too lot, you should apply some data binning or peak picking methods for data reduction. Also you must consider multicollinearity problems related to spectroscopic data, therefore data reduction with PCA or similar methods is essential step in your analysis.
But PCA method is also very sensitive to the noise and suprevised classification method could be more acceptable, for example PLS-DA.

You should take a look on pls package. And caret package has very well writen routines for model reproducibility and stability tests, no only for PLS-DA but also otherm methods.Also package mclust could be useful.

Also you can take alook on this package: http://sourceforge.net/projects/kopls/

http://www.jstatsoft.org/v18/i06
http://cran.r-project.org/web/packages/caret/caret.pdf
http://www.jstatsoft.org/v18/i02

http://dx.doi.org/10.1002/cem.887
http://dx.doi.org/10.1186/1471-2105-9-106

Best regards

Dani Valverde wrote:
> Hello,
> I have a large data matrix (68x13112), each row corresponding to one
> observation (patients) and each column corresponding to the variables
> (points within an NMR spectrum). I would like to carry out some kind of
> clustering on these data to see how many clusters are there. I have
> tried the function clara() from the package cluster. If I use the matrix
> as is, I can perform the clara analysis but when I call clusplot() I get
> this error:
>
> Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) :
> 'princomp' can only be used with more units than variables
>
> Then, I reduce the dimensionality by using the function prcomp(). Then I
> take the 13 first principal components (80%< variability) and I carry
> out the clara() analysis again. Then, I call the clusplot() function
> again and voilą!, it works. The problem is that clusplot() only
> represents the two first components of my prcomp() analysis, which
> represents only 15% of the variability.
> So, my questions are 1) is clara() a proper way to analyze such a large
> data set? and 2) Is there an appropiate method for graphic plotting of
> my data, that takes into account the whole variability if my data, not
> just two principal components?
> Many thanks.
> Best,
>
> Dani
>

-- 
Andris Jankevics
Assistant
Department of Medicinal Chemistry
Latvian Institute of Organic Synthesis
Aizkraukles 21, LV-1006, Riga, Latvia

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Thu 06 Mar 2008 - 11:55:38 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 06 Mar 2008 - 12:30:19 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive