# [R] Top N correlations from 'cor' for very large datasets being run many times

From: Obi Griffith <obig_at_bcgsc.ca>
Date: Fri 10 Jun 2005 - 10:56:16 EST

I am doing an analysis that requires me to calculate correlations for a matrix of 15,000 rows x 50 columns. For each row I want to calculate the correlation to all other rows and then for each row, find the n (say 10) most correlated rows. If read in the 15,000 x 50 data from file and pass it to 'cor', this function quite appropriately (and very quickly) calculates all possible row by row comparisons and outputs a matrix of the results. The problem is that this matrix is exceedingly large (approx 1GB). I want to run this analysis thousands of times on a cluster and thus each job must be below 1GB (otherwise I'd just do it on a large memory machine - where it works fine). Since I am only interested in the top n correlations for each row, I would prefer to only store these correlations. However, If I use a loop strategy to calculate correlations and only keep the ones I want, it runs extremely slowly. The correlations for one row (versus all others) actually takes as long as all rows versus all rows using the non-loop strategy! Two questions:

1. Does this performance difference make sense? I expected looping to be slower but not that much slower.
2. Is there a way that I can pass the data matrix to 'cor' but only get back the top n correlations for each row in the output matrix? Or, is there another way to get correlations quickly but only store the best results?

Any help would be greatly appreciated. Obi

#The nice R way to get all possible correlations quickly - too much memory used
file1 = read.table("test.txt", header=F, quote="", sep="\t", comment.char="", as.is=1) file1_cor = cor(t(file1), method = "pearson", use = "pairwise.complete.obs") diag(file1_cor) = NA #Set correlation to self as NA for (i in 1:15000){
corrs=file1_cor[,i]
corrs_ordered=order(corrs,decreasing=TRUE) #Order correlations from largest to smallest   top_corrs=corrs[corrs_ordered[1:n]] #Get top n correlations - these would be added to some data structure and used for subsequent analysis }

#The not so nice way to get all possible correlations for each row and then store only those that I want to keep. - too slow
file1 = read.table("test.txt", header=F, quote="", sep="\t", comment.char="", as.is=1) for (i in 1:15000){
corrs = vector(length=15000)
for (j in 1:15000){
cor_ij = cor(as.numeric(file1[i,]), as.numeric(file1[j,]), method = "pearson", use = "pairwise.complete.obs")   corrs[j]=file1_gene_cor
}
corrs[i]=NA; #Set correlation to self as NA   corrs_ordered=order(corrs,decreasing=TRUE) #Order correlations from largest to smallest   top_corrs=corrs[corrs_ordered[1:n]] #Get top n correlations - these would be added to some data structure and used for subsequent analysis }

[[alternative HTML version deleted]]

R-help@stat.math.ethz.ch mailing list