*> system.time(out <- which(foo$two=="A"))
*

Emmanuel,

On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy <emmanuel.levy_at_gmail.com> wrote:

> Dear All,

*>
**> I have a large data frame ( 2700000 lines and 14 columns), and I would like to
**> extract the information in a particular way illustrated below:
**>
**>
**> Given a data frame "df":
**>
**>> col1=sample(c(0,1),10, rep=T)
**>> names = factor(c(rep("A",5),rep("B",5)))
**>> df = data.frame(names,col1)
**>> df
**> names col1
**> 1 A 1
**> 2 A 0
**> 3 A 1
**> 4 A 0
**> 5 A 1
**> 6 B 0
**> 7 B 0
**> 8 B 1
**> 9 B 0
**> 10 B 0
**>
**> I would like to tranform it in the form:
**>
**>> index = c("A","B")
**>> col1[[1]]=df$col1[which(df$name=="A")]
**>> col1[[2]]=df$col1[which(df$name=="B")]
*

I'm not sure I fully understand your problem, you example would not run for me.

You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup.

> n <- 2700000

*> foo <- data.frame(
*

+ one = sample(c(0,1), n, rep = T), + two = factor(c(rep("A", n/2 ),rep("B", n/2 ))) + )

user system elapsed

0.566 0.146 0.761

*> system.time(out <- foo$two=="A")
*

user system elapsed

0.429 0.075 0.588

You might also find use for unstack(), though I didn't see a speedup.

*> system.time(out <- unstack(foo))
*

user system elapsed

1.068 0.697 2.004

**HTH
**
Peter

> My problem is that the command: *** which(df$name=="A") ***

*> takes about 1 second because df is so big.
**>
**> I was thinking that a "level" could maybe be accessed instantly but I am not
**> sure about how to do it.
**>
**> I would be very grateful for any advice that would allow me to speed this up.
**>
**> Best wishes,
**>
**> Emmanuel
*

