Re: [R] fast way to compare two matrices of combinations

From: Mark W Kimpel <mwkimpel_at_gmail.com>
Date: Thu, 13 Mar 2008 20:49:28 -0400

Thanks to all for their suggestions. I apologize for not supplying a self-contained example, I should not post questions when I'm on the way out the door.

Martin's suggestion should work, but I need to put in on our high-performance system next week. On my local 64-bit Linux box with 4GB of RAM it blew up when a vector reached 2.6GB.

I may also get something to work using Charles' suggestion to use R's intrinsic table functions. I initially could not see how to do this with   a vector of 3 elements, but I believe I can if I sort each vector, to obviate effects of order, and paste them together to make one unique string.

Once I get something that works and is an optimized as I can make it, I'll post for future reference and for suggestions on further optimization.

Mark

Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine

15032 Hunter Court, Westfield, IN 46074

(317) 490-5129 Work, & Mobile & VoiceMail (317) 204-4202 Home (no voice mail please)

mwkimpel<at>gmail<dot>com


Charles C. Berry wrote:
> On Thu, 13 Mar 2008, Mark W Kimpel wrote:
>

>> I have a list (length 750), each element containing a vector of unique
>> strings (unique gene ids), with length up to ~40 (median 15). I want to
>> compile a matrix of all possible triplets and their frequency within
>> gene elements. Using combn and a lot of looping, I am accomplishing this
>> but it is VERY slow.
>>
>> I've tried to figure out a way to vectorize this, using "match" and
>> "%in%", but can't get my mind around it.
>>
>> Below is my code. sig.tf.pairs is the list. Suggestions?

>
> First, be sure that your code does what you really intend for it to do.
>
> Does this really do what you wanted?
>
> if (length(intersect(triplets[,m], all.triplets[,k] == M))){
>
> If so, then why does the first line below never produce an error?
>
> count.vec <- count.vec[,-redundant.vec]
>
> is.null(dim(count.vec)) ## TRUE
>
> You are basically tabulating. Use the functions that are built for that.
>
> It looks like what you want is along these lines:
>
> tab.combns <- function(x) apply( combn( sort(x), M ),2,
> function(x) paste(x,collapse=''))
>
> tab.all <- table( unlist( lapply(sig.tf.pairs,tab.combns) ) )
>
> Chuck
>>
>> Mark
>>
>>
>> ############################################################
>> M <- 3 # 3 for triplets, etc.
>> ##########################################################
>> # count all triplets
>> all.triplets <- NULL
>> all.count.vec <- NULL
>> for (i in 1:length(sig.tf.pairs)){
>>   if (length(sig.tf.pairs[[i]] >= M)){
>>     triplets <- combn(sig.tf.pairs[[i]], M, simplify = TRUE)
>>     for (j in 1:ncol(triplets)){
>>       o <- order(triplets[,j])
>>       triplets[,j] <- triplets[o,j]
>>       count.vec <- rep(1, ncol(triplets))
>>     }
>>     if (is.null(all.count.vec)){
>>       all.count.vec <- count.vec
>>       all.triplets <- triplets
>>     } else {
>>       redundant.vec <- NULL
>>       for (k in 1:ncol(all.triplets)){
>>         for (m in 1:ncol(triplets)){
>>           if (length(intersect(triplets[,m], all.triplets[,k] == M))){
>>             all.count.vec[k] <- all.count.vec[k] + 1
>>             redundant.vec <- c(redundant.vec, m)
>>           }
>>         }
>>       }
>>       if(!is.null(redundant.vec)){
>>         triplets <- triplets[,-redundant.vec]
>>         count.vec <- count.vec[,-redundant.vec]
>>       }
>>       all.triplets <- cbind(all.triplets, triplets)
>>       all.count.vec <- c(all.count.vec, count.vec)
>>     }
>>   }
>> }
>> ###################################
>>
>> -- 
>>
>> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
>> Indiana University School of Medicine
>>
>> 15032 Hunter Court, Westfield, IN  46074
>>
>> (317) 490-5129 Work, & Mobile & VoiceMail
>> (317) 204-4202 Home (no voice mail please)
>>
>> mwkimpel<at>gmail<dot>com
>>
>> ______________________________________________
>> R-help_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>

>
> Charles C. Berry (858) 534-2098
> Dept of Family/Preventive
> Medicine
> E mailto:cberry_at_tajo.ucsd.edu UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
>
>
>

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 14 Mar 2008 - 00:55:02 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 14 Mar 2008 - 01:30:21 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive