From: Thomas Lumley <tlumley_at_u.washington.edu>

Date: Sat 15 Oct 2005 - 00:38:22 EST

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sat Oct 15 00:44:23 2005

Date: Sat 15 Oct 2005 - 00:38:22 EST

On Fri, 14 Oct 2005, Ido M. Tamir wrote:

> Hello,

*> i am trying to subset a dataframe multiple times:
**> something like:
**>
**> stats <- by(df, list(items), ttestData)
**>
**> ttestData <- function(df){
**> t.test( df[,c(2,3,4), df[,c(5,6,7)]
**> }
**>
**> While this works for small data, it is to slow for my
**> actual data: 500000 rows dataframe with
**> about 135000 different indices, subsetting the
**> dataframe into chunks of 5 on average.
**>
**> Do you have any suggestions how I could speed this up?
*

The first step is to find out what is too slow, using Rprof(). It may be the t.test or it may be the by().

If it is the by() you could put the numeric data into two matrices

x1<-df[,2:4]

x2<-df[,5:7]

order them so that the same "item" entries were adjacent, compute the
start and end indices for each group, and do something like
lapply(1:howevermany, function(i) t.test(x1[start[i]:end[i],],x2[start[i]:end[i]))
Even just turning df into a matrix might help

If it is the repeated t.test() calls that are too slow you need to speed them up. You can probably rowsum() to compute means and variances for all the t-tests at once.

-thomas

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sat Oct 15 00:44:23 2005

*
This archive was generated by hypermail 2.1.8
: Sun 23 Oct 2005 - 18:54:04 EST
*