Re: [R] millions of comparisons, speed wanted

From: Martin Maechler <maechler_at_stat.math.ethz.ch>
Date: Sat 17 Dec 2005 - 02:27:42 EST

I have not taken the time to look into this example, but

        daisy()
from the (recommended, hence part of R) package 'cluster' is more flexible than dist(), particularly in the case of NAs and for (a mixture of continuous and) categorical variables.

It uses a version of Gower's formula in order to deal with NAs and asymmetric binary variables. The example below look like very well matching to this problem.

Regards,
Martin Maechler, ETH Zurich

>>>>> "Adrian" == Adrian DUSA <adi@roda.ro> >>>>> on Thu, 15 Dec 2005 22:04:01 +0200 writes:

    Adrian> Dear Andy,
    Adrian> On Thursday 15 December 2005 20:57, Liaw, Andy wrote:
>> Just some untested idea:
>> If the data are all 0/1, you could use dist(input, method="manhattan"), and
>> then check which entry equals 1. This should be much faster than creating
>> all pairs of rows and check position-by-position.

    Adrian> Thanks for the idea, I played a little with it. At the beginning yes, the data 
    Adrian> are all 0/1, but during the minimizing iterations there are also "x" values; 
    Adrian> for example comparing:
    Adrian> 0 1 0 1 1
    Adrian> 0 0 0 1 1
    Adrian> should return
    Adrian> 0 "x" 0 1 1

    Adrian> whereas
    Adrian> 0 "x" 0 1 1

    Adrian> 0 0 0 1 1
    Adrian> shouldn't even be compared (they have different number of figures).
    Adrian> Replacing "x" with NA in dist is not yielding results either, as with
    Adrian> NA 0 0 1 1
    Adrian> 0 0 0 1 1
    Adrian> dist returns 0.

    Adrian> I even wanted to see if I could tweak the dist code, but it calls a C program     Adrian> and I gave up.

    Adrian> Nice idea anyhow, maybe I'll find a way to use it further.
    Adrian> Best,
    Adrian> Adrian

    Adrian> -- 
    Adrian> Adrian DUSA
    Adrian> Romanian Social Data Archive
    Adrian> 1, Schitu Magureanu Bd
    Adrian> 050025 Bucharest sector 5
    Adrian> Romania

    Adrian> Tel./Fax: +40 21 3126618 \
    Adrian> +40 21 3120210 / int.101
    Adrian> ______________________________________________
    Adrian> R-help@stat.math.ethz.ch mailing list
    Adrian> https://stat.ethz.ch/mailman/listinfo/r-help     Adrian> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sat Dec 17 02:35:16 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:41:39 EST