From: Kevin B. Hendricks <kevin.hendricks_at_sympatico.ca>

Date: Thu 27 Jul 2006 - 13:31:03 GMT

}

r <- c(r,fcn(v[bp:i]))

}

R-devel@r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu Jul 27 23:36:07 2006

Date: Thu 27 Jul 2006 - 13:31:03 GMT

My first R attempt was a simple

# sort the data.frame gd and the sort key

sorder <- order(MDPC)

gd <- gd[sorder,]

MDPC <- MDPC[sorder]

attach(gd)

# find the length and sum for each unique sort key

XN <- by(MVE, MDPC, length)

XSUM <- by(MVE, MDPC, sum)

GRPS <- levels(as.factor(MDPC))

Well the ordering and sorting was reasonably fast but the first "by" statement was still running 4 hours later on my machine (a dual 2.6 gig Opteron with 4 gig of main memory). This same snippet of code in SAS running on a slower machine takes about 5 minutes of system time.

I tried various simple R implementations of a "by_sorted" that I thought might help

*# walk sorted array once and keep track of beginning
**# and ending points for each unique sort key value in p
*

# and run function fcn on that sub sequence in vector v

# store the results in a vector

by_sorted <- function(v, p, fcn) {

key <- p[[1]]

bp <- 1

r <- NULL

for (i in 2:length(p)) {

if (key != p[[i]]) { r <- c(r,fcn(v[bp:i-1])) bp <- i key <- p[[i]] }

}

r <- c(r,fcn(v[bp:i]))

}

but they took "forever" to run also (read that I killed those attempts at 15 minutes of cpu time).

I literally had the same issue when trying with "tapply".

So unless it already exists someplace, I need a really fast implementation of "by" for very large sorted data sets (probably written in fortran or c) that will do the equivalent of what SAS does with its "proc univariate by" approach with close to the same performance. The code should only have to walk the array once (ie. be linear in time with the number of rows in the vector). I have similar issues with "merge" as well since merging data frames already sorted by the same sort key should be fast as well and does not appear to be.

Before I jump into this and create my own "sorted large data set" versions of "by" and "merge", I wanted to be sure it would be of interest to others. If they work well and are well implemented (a big if since I am really still just learning this - the whole point of the project!) would something like this be of any interest for internal use of R? Or is this something too specialized? Is there an R function implemented in c or fortran that would make a good "model" to follow for implementing something like this? Would/should they be extensions of current implementations of "merge" and "by" with an additions of a sorted=TRUE (defaulting to FALSE) extra parameter.

Or am I simply barking up the wrong tree here?

Thanks,

Kevin

R-devel@r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu Jul 27 23:36:07 2006

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.1.8, at Fri 28 Jul 2006 - 10:29:28 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*