[Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

From: Kevin B. Hendricks <kevin.hendricks_at_sympatico.ca>
Date: Thu 27 Jul 2006 - 13:31:03 GMT


Hi Developers,

I am looking for another new project to help me get more up to speed on R and to learn something outside of R internals. One recent R issue I have run into is finding a fast implementations of the equivalent to the following SAS code:

/* MDPC is an integer sort key made from two integer columns */ MDPC = (MD * 100000) + PCO; /* sort the dataset by the key */
PROC SORT;
   BY MDPC; /* print out count and sum for each unique sort key (subgroup) */ /* use of BY MDPC requires sorting that data set by MDPC first in SAS */ PROC UNIVARIATE NOPRINT;
    VAR MVE;
    BY MDPC;
    OUTPUT OUT=TMP0 N=XN SUM=XSUM; Easy to do in R but the problem is the data set this is being run on has 1,742,201 lines in it and takes up 196,868,713 bytes to store as character data. The sort key has easily has over 200,000 unique keys (if not twice that).

My first R attempt was a simple

# sort the data.frame gd and the sort key
sorder <- order(MDPC)
gd <- gd[sorder,]
MDPC <- MDPC[sorder]
attach(gd)

# find the length and sum for each unique sort key
XN <- by(MVE, MDPC, length)
XSUM <- by(MVE, MDPC, sum)
GRPS <- levels(as.factor(MDPC))

Well the ordering and sorting was reasonably fast but the first "by" statement was still running 4 hours later on my machine (a dual 2.6 gig Opteron with 4 gig of main memory). This same snippet of code in SAS running on a slower machine takes about 5 minutes of system time.

I tried various simple R implementations of a "by_sorted" that I thought might help

# walk sorted array once and keep track of beginning
# and ending points for each unique sort key value in p
# and run function fcn on that sub sequence in vector v
# store the results in a vector

by_sorted <- function(v, p, fcn) {

    key <- p[[1]]
    bp <- 1
    r <- NULL
    for (i in 2:length(p)) {

       if (key != p[[i]]) {
           r <- c(r,fcn(v[bp:i-1]))
           bp <- i
           key <- p[[i]]
       }

    }
    r <- c(r,fcn(v[bp:i]))
}

but they took "forever" to run also (read that I killed those attempts at 15 minutes of cpu time).

I literally had the same issue when trying with "tapply".

So unless it already exists someplace, I need a really fast implementation of "by" for very large sorted data sets (probably written in fortran or c) that will do the equivalent of what SAS does with its "proc univariate by" approach with close to the same performance. The code should only have to walk the array once (ie. be linear in time with the number of rows in the vector). I have similar issues with "merge" as well since merging data frames already sorted by the same sort key should be fast as well and does not appear to be.

Before I jump into this and create my own "sorted large data set" versions of "by" and "merge", I wanted to be sure it would be of interest to others. If they work well and are well implemented (a big if since I am really still just learning this - the whole point of the project!) would something like this be of any interest for internal use of R? Or is this something too specialized? Is there an R function implemented in c or fortran that would make a good "model" to follow for implementing something like this? Would/should they be extensions of current implementations of "merge" and "by" with an additions of a sorted=TRUE (defaulting to FALSE) extra parameter.

Or am I simply barking up the wrong tree here?

Thanks,

Kevin



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu Jul 27 23:36:07 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Fri 28 Jul 2006 - 10:29:28 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.