Re: [R] how to efficiently compute set unique?

From: David Winsemius <dwinsemius_at_comcast.net>
Date: Mon, 21 Jun 2010 21:38:36 -0400

On Jun 21, 2010, at 9:18 PM, Duncan Murdoch wrote:

> On 21/06/2010 9:06 PM, G FANG wrote:
>> Hi,
>>
>> I want to get the unique set from a large numeric k by 1 vector, k is
>> in tens of millions
>>
>> when I used the matlab function unique, it takes less than 10 secs
>>
>> but when I tried to use the unique in R with similar CPU and memory,
>> it is not done in minutes
>>
>> I am wondering, am I using the function in the right way?
>>
>> dim(cntxtn)
>> [1] 13584763 1
>> uniqueCntxt = unique(cntxtn); # this is taking really long
>
> What type is cntxtn? If I do that sort of thing on a numeric
> vector, it's quite fast:
>
> > x <- sample(100000, size=13584763, replace=T)
> > system.time(unique(x))
> user system elapsed
> 3.61 0.14 3.75

If it's a factor, it could be as simple as:

levels(cntxtn) # since the work of "unique-ification" has already been done.

 > x <- factor(sample(100000, size=13584763, replace=T))  > system.time(levels(x))

    user system elapsed

       0 0 0
 > system.time(y <- levels(x))

    user system elapsed

       0 0 0

-- 

David Winsemius, MD
West Hartford, CT

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Tue 22 Jun 2010 - 01:40:19 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 22 Jun 2010 - 04:30:34 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive