# Re: [R] millions of comparisons, speed wanted

From: Liaw, Andy <andy_liaw_at_merck.com>
Date: Fri 16 Dec 2005 - 05:57:37 EST

If the data are all 0/1, you could use dist(input, method="manhattan"), and then check which entry equals 1. This should be much faster than creating all pairs of rows and check position-by-position.

HTH,
Andy

>
> Dear all,
>
> I have a 10 columns matrix which has 2^10=1024 unique rows,
> with values 0 and
> 1.
> What I would like to do is a little complicated; in a simple
> statement, for a
> subset (say 1000 rows) I want to perform pairwise comparisons
> between all
> rows, find out which rows differ by only one value, replace
> that value by
> "x", get rid of the comparisons that differ by more than one
> value and repeat
> the algorithm until no further minimization is possible. Any
> row that hasn't
> been minimized is kept for the next iteration.
>
> For 1,000 rows, there are almost 500,000 pairs, but in the
> next iterations I
> could get some 5,000 rows which generates something like 12.5
> millions pairs,
> and that takes a _very_ long time.
>
> The code I have created (10 lines, below) is super fast
> (using vectorization)
> but only for a reasonable number of rows. I am searching for:
> - ways to improve my code (if possible)
> - ideas: create a C code for the slow parts of the code? use
> MySQL? other
> ways?
>
> As a toy example, having an input matrix called "input", my
> algorithm looks
> like this:
>
> ## code start
> ncolumns <- 6
> input <- bincombinations(ncolumns) # from package e1071
> # subset, let's say 97% of rows
> input <- input[sample(2^ncolumns, round(2^ncolumns*0.97, 0), ]
> minimized <- 1
>
> while (sum(minimized) > 0) {
>
> minimized <- logical(nrow(input))
>
> to.be.compared <- combn2(1:nrow(input)) # from package combinat
>
> # the following line takes _a lot_ of time, for millions
> of comparisons
> logical.result <- apply(to.be.compared, 1, function(idx)
> input[idx[1], ] ==
> input[idx[2], ])
>
> compare.minimized <- which(colSums(!logical.result) == 1)
>
> logical.result <- logical.result[, compare.minimized]
>
> result <- sapply(compare.minimized, function(idx)
> input[to.be.compared[idx,
> 1], ])
>
> result[!logical.result] <- "x"
>
>
> minimized[unique(as.vector(to.be.compared[compare.minimized,
> ]))] <- TRUE
>
> if (sum(minimized) > 0) {
> input <- rbind(input[!minimized, ], unique(t(result)))
> }
> }
> ## code end
>
> Any suggestion is welcomed, thank you very much in advance.
>
> --
> Romanian Social Data Archive
> 1, Schitu Magureanu Bd
> 050025 Bucharest sector 5
> Romania
> Tel./Fax: +40 21 3126618 \
> +40 21 3120210 / int.101
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help