From: Brian D Ripley <ripley_at_stats.ox.ac.uk>

Date: Fri 28 Jul 2006 - 07:06:54 GMT

Date: Fri 28 Jul 2006 - 07:06:54 GMT

Which version of R are you looking at? R-devel has

o merge() works more efficiently when there are relatively few matches between the data frames (for example, for 1-1 matching). The order of the result is changed for 'sort = FALSE'.

On Thu, 27 Jul 2006, Kevin B. Hendricks wrote:

> Hi Developers,

*>
**> I am looking for another new project to help me get more up to speed
**> on R and to learn something outside of R internals. One recent R
**> issue I have run into is finding a fast implementations of the
**> equivalent to the following SAS code:
**>
**> /* MDPC is an integer sort key made from two integer columns */
**> MDPC = (MD * 100000) + PCO;
**>
**> /* sort the dataset by the key */
**> PROC SORT;
**> BY MDPC;
**>
**> /* print out count and sum for each unique sort key (subgroup) */
**> /* use of BY MDPC requires sorting that data set by MDPC first in SAS */
**> PROC UNIVARIATE NOPRINT;
**> VAR MVE;
**> BY MDPC;
**> OUTPUT OUT=TMP0 N=XN SUM=XSUM;
**>
**> Easy to do in R but the problem is the data set this is being run on
**> has 1,742,201 lines in it and takes up 196,868,713 bytes to store as
**> character data. The sort key has easily has over 200,000 unique keys
**> (if not twice that).
**>
**> My first R attempt was a simple
**>
**> # sort the data.frame gd and the sort key
**> sorder <- order(MDPC)
**> gd <- gd[sorder,]
**> MDPC <- MDPC[sorder]
**> attach(gd)
**>
**> # find the length and sum for each unique sort key
**> XN <- by(MVE, MDPC, length)
**> XSUM <- by(MVE, MDPC, sum)
**> GRPS <- levels(as.factor(MDPC))
**>
**> Well the ordering and sorting was reasonably fast but the first "by"
**> statement was still running 4 hours later on my machine (a dual 2.6
**> gig Opteron with 4 gig of main memory). This same snippet of code in
**> SAS running on a slower machine takes about 5 minutes of system time.
**>
**> I tried various simple R implementations of a "by_sorted" that I
**> thought might help
**>
**> # walk sorted array once and keep track of beginning
**> # and ending points for each unique sort key value in p
**> # and run function fcn on that sub sequence in vector v
**> # store the results in a vector
**>
**> by_sorted <- function(v, p, fcn) {
**> key <- p[[1]]
**> bp <- 1
**> r <- NULL
**> for (i in 2:length(p)) {
**> if (key != p[[i]]) {
**> r <- c(r,fcn(v[bp:i-1]))
**> bp <- i
**> key <- p[[i]]
**> }
**> }
**> r <- c(r,fcn(v[bp:i]))
**> }
**>
**> but they took "forever" to run also (read that I killed those
**> attempts at 15 minutes of cpu time).
**>
**> I literally had the same issue when trying with "tapply".
**>
**> So unless it already exists someplace, I need a really fast
**> implementation of "by" for very large sorted data sets (probably
**> written in fortran or c) that will do the equivalent of what SAS does
**> with its "proc univariate by" approach with close to the same
**> performance. The code should only have to walk the array once (ie.
**> be linear in time with the number of rows in the vector). I have
**> similar issues with "merge" as well since merging data frames already
**> sorted by the same sort key should be fast as well and does not
**> appear to be.
**>
**> Before I jump into this and create my own "sorted large data set"
**> versions of "by" and "merge", I wanted to be sure it would be of
**> interest to others. If they work well and are well implemented (a
**> big if since I am really still just learning this - the whole point
**> of the project!) would something like this be of any interest for
**> internal use of R? Or is this something too specialized?
**> Is there an R function implemented in c or fortran that would make a
**> good "model" to follow for implementing something like this?
**> Would/should they be extensions of current implementations of "merge"
**> and "by" with an additions of a sorted=TRUE (defaulting to FALSE)
**> extra parameter.
**>
**> Or am I simply barking up the wrong tree here?
**>
**> Thanks,
**>
**> Kevin
**>
**> ______________________________________________
**> R-devel@r-project.org mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-devel
**>
*

-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-develReceived on Fri Jul 28 18:19:45 2006

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.1.8, at Fri 28 Jul 2006 - 18:27:34 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*