Re: [Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

From: Bill Dunlap <bill_at_insightful.com>
Date: Fri 28 Jul 2006 - 20:51:32 GMT

On Fri, 28 Jul 2006, Kevin B. Hendricks wrote:

>
> In my particular case, the problem was my data frame had over 1
> million lines had probably over 500,000 unique sort keys (ie. think
> of it as an R factor with over 500,000 levels). The implementation
> of "by" uses "tapply" which in turn uses "split". So "split" simply
> ate up all the time trying to create 500,000 vectors each of short
> length 1, 2, or 3; and the associated garbage collection.
>
> I simple loop that walked the short sequence of values (since the
> data frame was already sorted) calculating what it needed, would work
> much faster than splitting the original vector into so very many
> smaller vectors (and the associated copying of data).
>
> That problem is very similar problem to the calculation of basic
> stats on a short moving window over a very long vector.
>
> I will look into that package and maybe use it for a model for what I
> want to do.

Splus8.0 has something like what you are talking about that provides a fast way to compute

    sapply(split(xVector, integerGroupCode), summaryFunction) for some common summary functions. The 'integerGroupCode' is typically the codes from a factor, but you could compute it in other ways. It needs to be a "small" integer in the range 1:ngroups (like the 'bin' argument to tabulate). Like tabulate(), which is called from table(), these are meant to be called from other functions that can set up appropriate group codes. E.g., groupSums or rowSums or fancier things could be based on this.

They do not insist you sort the input in any way. That would really only be useful for group medians and we haven't written that one yet.

The typical prototype is

> igroupSums

function(x, group = NULL, na.rm = F, weights = NULL, ngroups = if(is.null(

        group)) 1 else max(as.integer(group), na.rm = T))

and the currently supported summary functions are

   mean : igroupMeans
   sum : igroupSums
   prod : igroupProds
   min : igroupMins
   max : igroupMaxs
   range : igroupRanges
   any : igroupAnys
   all : igroupAlls
(The plurals are not all properly spelled and they are plural so match related functions like rowSums.) It might be useful to also have one to count the number of non-missing values in each group of x's.

They are fast:

   > x<-runif(2e6)
   > i<-rep(1:1e6, 2)
   > sys.time(sx <- igroupSums(x,i))

   [1] 0.66 0.67
   > length(sx)
   [1] 1000000

On that machine R takes 44 seconds to go the lapply/split route:

   > unix.time(unlist(lapply(split(x,i), sum)))    [1] 43.24 0.78 44.11 0.00 0.00

To save setup time in the S code, out-of-range values in the group argument (negatives, values greater than ngroup, and NA's), mean that the correponding element in x is ignored.



Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do  not necessarily reflect Insightful Corporation policy or position."



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat Jul 29 06:53:53 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Sat 29 Jul 2006 - 00:27:25 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.