From: Emmanuel Levy <emmanuel.levy_at_gmail.com>

Date: Wed, 13 Aug 2008 12:03:37 -0400

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 13 Aug 2008 - 16:05:48 GMT

Date: Wed, 13 Aug 2008 12:03:37 -0400

Wow great! Split was exactly what was needed. It takes about 1 second for the whole operation :D

Thanks again - I can't believe I never used this function in the past.

2008/8/13 Erik Iverson <iverson_at_biostat.wisc.edu>:

> I still don't understand what you are doing. Can you make a small example

*> that shows what you have and what you want?
**>
**> Is ?split what you are after?
**>
**> Emmanuel Levy wrote:
**>>
**>> Dear Peter and Henrik,
**>>
**>> Thanks for your replies - this helps speed up a bit, but I thought
**>> there would be something much faster.
**>>
**>> What I mean is that I thought that a particular value of a level
**>> could be accessed instantly, similarly to a "hash" key.
**>>
**>> Since I've got about 6000 levels in that data frame, it means that
**>> making a list L of the form
**>> L[[1]] = values of name "1"
**>> L[[2]] = values of name "2"
**>> L[[3]] = values of name "3"
**>> ...
**>> would take ~1hour.
**>>
**>> Best,
**>>
**>> Emmanuel
**>>
**>>
**>>
**>>
**>> 2008/8/12 Henrik Bengtsson <hb_at_stat.berkeley.edu>:
**>>>
**>>> To simplify:
**>>>
**>>> n <- 2.7e6;
**>>> x <- factor(c(rep("A", n/2), rep("B", n/2)));
**>>>
**>>> # Identify 'A':s
**>>> t1 <- system.time(res <- which(x == "A"));
**>>>
**>>> # To compare a factor to a string, the factor is in practice
**>>> # coerced to a character vector.
**>>> t2 <- system.time(res <- which(as.character(x) == "A"));
**>>>
**>>> # Interestingly enough, this seems to be faster (repeated many times)
**>>> # Don't know why.
**>>> print(t2/t1);
**>>> user system elapsed
**>>> 0.632653 1.600000 0.754717
**>>>
**>>> # Avoid coercing the factor, but instead coerce the level compared to
**>>> t3 <- system.time(res <- which(x == match("A", levels(x))));
**>>>
**>>> # ...but gives no speed up
**>>> print(t3/t1);
**>>> user system elapsed
**>>> 1.041667 1.000000 1.018182
**>>>
**>>> # But coercing the factor to integers does
**>>> t4 <- system.time(res <- which(as.integer(x) == match("A", levels(x))))
**>>> print(t4/t1);
**>>> user system elapsed
**>>> 0.4166667 0.0000000 0.3636364
**>>>
**>>> So, the latter seems to be the fastest way to identify those elements.
**>>>
**>>> My $.02
**>>>
**>>> /Henrik
**>>>
**>>>
**>>> On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan <cowan.pd_at_gmail.com> wrote:
**>>>>
**>>>> Emmanuel,
**>>>>
**>>>> On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <emmanuel.levy_at_gmail.com>
**>>>> wrote:
**>>>>>
**>>>>> Dear All,
**>>>>>
**>>>>> I have a large data frame ( 2700000 lines and 14 columns), and I would
**>>>>> like to
**>>>>> extract the information in a particular way illustrated below:
**>>>>>
**>>>>>
**>>>>> Given a data frame "df":
**>>>>>
**>>>>>> col1=sample(c(0,1),10, rep=T)
**>>>>>> names = factor(c(rep("A",5),rep("B",5)))
**>>>>>> df = data.frame(names,col1)
**>>>>>> df
**>>>>>
**>>>>> names col1
**>>>>> 1 A 1
**>>>>> 2 A 0
**>>>>> 3 A 1
**>>>>> 4 A 0
**>>>>> 5 A 1
**>>>>> 6 B 0
**>>>>> 7 B 0
**>>>>> 8 B 1
**>>>>> 9 B 0
**>>>>> 10 B 0
**>>>>>
**>>>>> I would like to tranform it in the form:
**>>>>>
**>>>>>> index = c("A","B")
**>>>>>> col1[[1]]=df$col1[which(df$name=="A")]
**>>>>>> col1[[2]]=df$col1[which(df$name=="B")]
**>>>>
**>>>> I'm not sure I fully understand your problem, you example would not run
**>>>> for me.
**>>>>
**>>>> You could get a small speedup by omitting which(), you can subset by a
**>>>> logical vector also which give a small speedup.
**>>>>
**>>>>> n <- 2700000
**>>>>> foo <- data.frame(
**>>>>
**>>>> + one = sample(c(0,1), n, rep = T),
**>>>> + two = factor(c(rep("A", n/2 ),rep("B", n/2 )))
**>>>> + )
**>>>>>
**>>>>> system.time(out <- which(foo$two=="A"))
**>>>>
**>>>> user system elapsed
**>>>> 0.566 0.146 0.761
**>>>>>
**>>>>> system.time(out <- foo$two=="A")
**>>>>
**>>>> user system elapsed
**>>>> 0.429 0.075 0.588
**>>>>
**>>>> You might also find use for unstack(), though I didn't see a speedup.
**>>>>>
**>>>>> system.time(out <- unstack(foo))
**>>>>
**>>>> user system elapsed
**>>>> 1.068 0.697 2.004
**>>>>
**>>>> HTH
**>>>>
**>>>> Peter
**>>>>
**>>>>> My problem is that the command: *** which(df$name=="A") ***
**>>>>> takes about 1 second because df is so big.
**>>>>>
**>>>>> I was thinking that a "level" could maybe be accessed instantly but I
**>>>>> am not
**>>>>> sure about how to do it.
**>>>>>
**>>>>> I would be very grateful for any advice that would allow me to speed
**>>>>> this up.
**>>>>>
**>>>>> Best wishes,
**>>>>>
**>>>>> Emmanuel
**>>>>
**>>>> ______________________________________________
**>>>> R-help_at_r-project.org mailing list
**>>>> https://stat.ethz.ch/mailman/listinfo/r-help
**>>>> PLEASE do read the posting guide
**>>>> http://www.R-project.org/posting-guide.html
**>>>> and provide commented, minimal, self-contained, reproducible code.
**>>>>
**>>
**>> ______________________________________________
**>> R-help_at_r-project.org mailing list
**>> https://stat.ethz.ch/mailman/listinfo/r-help
**>> PLEASE do read the posting guide
**>> http://www.R-project.org/posting-guide.html
**>> and provide commented, minimal, self-contained, reproducible code.
**>
*

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 13 Aug 2008 - 16:05:48 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Wed 13 Aug 2008 - 17:33:33 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*