From: Liaw, Andy <andy_liaw_at_merck.com>

Date: Fri 16 Dec 2005 - 05:57:37 EST

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Fri Dec 16 06:05:39 2005

Date: Fri 16 Dec 2005 - 05:57:37 EST

If the data are all 0/1, you could use dist(input, method="manhattan"), and then check which entry equals 1. This should be much faster than creating all pairs of rows and check position-by-position.

**HTH,
**

Andy

From: Adrian DUSA

*>
*

> Dear all,

*>
**> I have a 10 columns matrix which has 2^10=1024 unique rows,
**> with values 0 and
**> 1.
**> What I would like to do is a little complicated; in a simple
**> statement, for a
**> subset (say 1000 rows) I want to perform pairwise comparisons
**> between all
**> rows, find out which rows differ by only one value, replace
**> that value by
**> "x", get rid of the comparisons that differ by more than one
**> value and repeat
**> the algorithm until no further minimization is possible. Any
**> row that hasn't
**> been minimized is kept for the next iteration.
**>
**> For 1,000 rows, there are almost 500,000 pairs, but in the
**> next iterations I
**> could get some 5,000 rows which generates something like 12.5
**> millions pairs,
**> and that takes a _very_ long time.
**>
**> The code I have created (10 lines, below) is super fast
**> (using vectorization)
**> but only for a reasonable number of rows. I am searching for:
**> - ways to improve my code (if possible)
**> - ideas: create a C code for the slow parts of the code? use
**> MySQL? other
**> ways?
**>
**> As a toy example, having an input matrix called "input", my
**> algorithm looks
**> like this:
**>
**> ## code start
**> ncolumns <- 6
**> input <- bincombinations(ncolumns) # from package e1071
**> # subset, let's say 97% of rows
**> input <- input[sample(2^ncolumns, round(2^ncolumns*0.97, 0), ]
**> minimized <- 1
**>
**> while (sum(minimized) > 0) {
**>
**> minimized <- logical(nrow(input))
**>
**> to.be.compared <- combn2(1:nrow(input)) # from package combinat
**>
**> # the following line takes _a lot_ of time, for millions
**> of comparisons
**> logical.result <- apply(to.be.compared, 1, function(idx)
**> input[idx[1], ] ==
**> input[idx[2], ])
**>
**> compare.minimized <- which(colSums(!logical.result) == 1)
**>
**> logical.result <- logical.result[, compare.minimized]
**>
**> result <- sapply(compare.minimized, function(idx)
**> input[to.be.compared[idx,
**> 1], ])
**>
**> result[!logical.result] <- "x"
**>
**>
**> minimized[unique(as.vector(to.be.compared[compare.minimized,
**> ]))] <- TRUE
**>
**> if (sum(minimized) > 0) {
**> input <- rbind(input[!minimized, ], unique(t(result)))
**> }
**> }
**> ## code end
**>
**> Any suggestion is welcomed, thank you very much in advance.
**> Adrian
**>
**> --
**> Adrian DUSA
**> Romanian Social Data Archive
**> 1, Schitu Magureanu Bd
**> 050025 Bucharest sector 5
**> Romania
**> Tel./Fax: +40 21 3126618 \
**> +40 21 3120210 / int.101
**>
**> ______________________________________________
**> R-help@stat.math.ethz.ch mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-help
**> PLEASE do read the posting guide!
**> http://www.R-project.org/posting-guide.html
**>
*

>

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Fri Dec 16 06:05:39 2005

*
This archive was generated by hypermail 2.1.8
: Fri 03 Mar 2006 - 03:41:39 EST
*