Re: [R] which(df$name=="A") takes ~1 second! (df is very large), but can it be speeded up?

From: Peter Cowan <cowan.pd_at_gmail.com>
Date: Tue, 12 Aug 2008 19:31:33 -0700

Emmanuel,

On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <emmanuel.levy_at_gmail.com> wrote:
> Dear All,
>
> I have a large data frame ( 2700000 lines and 14 columns), and I would like to
> extract the information in a particular way illustrated below:
>
>
> Given a data frame "df":
>
>> col1=sample(c(0,1),10, rep=T)
>> names = factor(c(rep("A",5),rep("B",5)))
>> df = data.frame(names,col1)
>> df
> names col1
> 1 A 1
> 2 A 0
> 3 A 1
> 4 A 0
> 5 A 1
> 6 B 0
> 7 B 0
> 8 B 1
> 9 B 0
> 10 B 0
>
> I would like to tranform it in the form:
>
>> index = c("A","B")
>> col1[[1]]=df$col1[which(df$name=="A")]
>> col1[[2]]=df$col1[which(df$name=="B")]

I'm not sure I fully understand your problem, you example would not run for me.

You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup.

> n <- 2700000
> foo <- data.frame(

+ 	one = sample(c(0,1), n, rep = T),
+ 	two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
+ 	)

> system.time(out <- which(foo$two=="A"))

   user system elapsed
  0.566 0.146 0.761
> system.time(out <- foo$two=="A")

   user system elapsed
  0.429 0.075 0.588

You might also find use for unstack(), though I didn't see a speedup.
> system.time(out <- unstack(foo))

   user system elapsed
  1.068 0.697 2.004

HTH Peter

> My problem is that the command: *** which(df$name=="A") ***
> takes about 1 second because df is so big.
>
> I was thinking that a "level" could maybe be accessed instantly but I am not
> sure about how to do it.
>
> I would be very grateful for any advice that would allow me to speed this up.
>
> Best wishes,
>
> Emmanuel



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 13 Aug 2008 - 02:38:06 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 13 Aug 2008 - 05:33:50 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive