[R] millions of comparisons, speed wanted

From: Adrian DUSA <dusa.adrian_at_gmail.com>
Date: Fri 16 Dec 2005 - 04:04:37 EST

Dear all,

I have a 10 columns matrix which has 2^10=1024 unique rows, with values 0 and 1.
What I would like to do is a little complicated; in a simple statement, for a subset (say 1000 rows) I want to perform pairwise comparisons between all rows, find out which rows differ by only one value, replace that value by "x", get rid of the comparisons that differ by more than one value and repeat the algorithm until no further minimization is possible. Any row that hasn't been minimized is kept for the next iteration.

For 1,000 rows, there are almost 500,000 pairs, but in the next iterations I could get some 5,000 rows which generates something like 12.5 millions pairs, and that takes a _very_ long time.

The code I have created (10 lines, below) is super fast (using vectorization) but only for a reasonable number of rows. I am searching for:
- ways to improve my code (if possible)

As a toy example, having an input matrix called "input", my algorithm looks like this:

## code start
ncolumns <- 6
input <- bincombinations(ncolumns) # from package e1071 # subset, let's say 97% of rows
input <- input[sample(2^ncolumns, round(2^ncolumns*0.97, 0), ] minimized <- 1

while (sum(minimized) > 0) {

   minimized <- logical(nrow(input))

   to.be.compared <- combn2(1:nrow(input)) # from package combinat    

   # the following line takes _a lot_ of time, for millions of comparisons    logical.result <- apply(to.be.compared, 1, function(idx) input[idx[1], ] == input[idx[2], ])

   compare.minimized <- which(colSums(!logical.result) == 1)

   logical.result <- logical.result[, compare.minimized]

   result <- sapply(compare.minimized, function(idx) input[to.be.compared[idx, 1], ])

   result[!logical.result] <- "x"

   minimized[unique(as.vector(to.be.compared[compare.minimized, ]))] <- TRUE

   if (sum(minimized) > 0) {

      input <- rbind(input[!minimized, ], unique(t(result)))    }
}
## code end

Any suggestion is welcomed, thank you very much in advance. Adrian

-- 
Adrian DUSA
Romanian Social Data Archive
1, Schitu Magureanu Bd
050025 Bucharest sector 5
Romania
Tel./Fax: +40 21 3126618 \
          +40 21 3120210 / int.101

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Fri Dec 16 04:13:08 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:41:39 EST