Re: [R] how to efficiently compute set unique?

From: Douglas Bates <bates_at_stat.wisc.edu>
Date: Mon, 21 Jun 2010 21:01:17 -0500

On Mon, Jun 21, 2010 at 8:38 PM, David Winsemius <dwinsemius_at_comcast.net> wrote:
>
> On Jun 21, 2010, at 9:18 PM, Duncan Murdoch wrote:
>
>> On 21/06/2010 9:06 PM, G FANG wrote:
>>>
>>> Hi,
>>>
>>> I want to get the unique set from a large numeric k by 1 vector, k is
>>> in tens of millions
>>>
>>> when I used the matlab function unique, it takes less than 10 secs
>>>
>>> but when I tried to use the unique in R with similar CPU and memory,
>>> it is not done in minutes
>>>
>>> I am wondering, am I using the function in the right way?
>>>
>>> dim(cntxtn)
>>> [1] 13584763        1
>>> uniqueCntxt = unique(cntxtn);    # this is taking really long
>>
>> What type is cntxtn?  If I do that sort of thing on a numeric vector, it's
>> quite fast:
>>
>> > x <- sample(100000, size=13584763, replace=T)
>> > system.time(unique(x))
>>  user  system elapsed
>>  3.61    0.14    3.75
>
> If it's a factor, it could be as simple as:
>
> levels(cntxtn)  # since the work of "unique-ification" has already been
> done.

Not quite. When you generate a factor, as you do in your example, the levels correspond to the unique values of the original vector. But when you take a subset of a factor the levels are preserved intact, even if some of those levels do not occur in the subset. This is why there are unusual arguments with names like drop.unused.levels in functions like model.frame. It is also a subtle difference in the behavior of factor(x) and as.factor(x) when x is already a factor.

> ff <- factor(sample.int(200, 1000, replace = TRUE))
> ff1 <- ff[1:40]
> length(levels(ff))


[1] 199
> length(levels(ff1))

[1] 199
> length(levels(as.factor(ff1)))
[1] 199
> length(levels(factor(ff1)))
[1] 34

>> x <- factor(sample(100000, size=13584763, replace=T))
>> system.time(levels(x))
>   user  system elapsed
>      0       0       0
>> system.time(y <- levels(x))
>   user  system elapsed
>      0       0       0



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Tue 22 Jun 2010 - 02:03:31 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 22 Jun 2010 - 06:20:33 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive