From: Bill Dunlap <bill_at_insightful.com>

Date: Fri 28 Jul 2006 - 20:51:32 GMT

[1] 0.66 0.67

> length(sx)

[1] 1000000

Bill Dunlap

Insightful Corporation

bill at insightful dot com

360-428-8146

R-devel@r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat Jul 29 06:53:53 2006

Date: Fri 28 Jul 2006 - 20:51:32 GMT

On Fri, 28 Jul 2006, Kevin B. Hendricks wrote:

*>
*

> In my particular case, the problem was my data frame had over 1

*> million lines had probably over 500,000 unique sort keys (ie. think
**> of it as an R factor with over 500,000 levels). The implementation
**> of "by" uses "tapply" which in turn uses "split". So "split" simply
**> ate up all the time trying to create 500,000 vectors each of short
**> length 1, 2, or 3; and the associated garbage collection.
**>
**> I simple loop that walked the short sequence of values (since the
**> data frame was already sorted) calculating what it needed, would work
**> much faster than splitting the original vector into so very many
**> smaller vectors (and the associated copying of data).
**>
**> That problem is very similar problem to the calculation of basic
**> stats on a short moving window over a very long vector.
**>
**> I will look into that package and maybe use it for a model for what I
**> want to do.
*

Splus8.0 has something like what you are talking about that provides a fast way to compute

*> igroupSums
*

function(x, group = NULL, na.rm = F, weights = NULL, ngroups = if(is.null(

group)) 1 else max(as.integer(group), na.rm = T))

and the currently supported summary functions are

mean : igroupMeans

sum : igroupSums

prod : igroupProds

min : igroupMins

max : igroupMaxs

range : igroupRanges

any : igroupAnys

all : igroupAlls

(The plurals are not all properly spelled and they
are plural so match related functions like rowSums.)
It might be useful to also have one to count the number
of non-missing values in each group of x's.

> x<-runif(2e6) > i<-rep(1:1e6, 2) > sys.time(sx <- igroupSums(x,i))

[1] 0.66 0.67

> length(sx)

[1] 1000000

On that machine R takes 44 seconds to go the lapply/split route:

> unix.time(unlist(lapply(split(x,i), sum))) [1] 43.24 0.78 44.11 0.00 0.00

To save setup time in the S code, out-of-range values in the group argument (negatives, values greater than ngroup, and NA's), mean that the correponding element in x is ignored.

Bill Dunlap

Insightful Corporation

bill at insightful dot com

360-428-8146

"All statements in this message represent the opinions of the author and do not necessarily reflect Insightful Corporation policy or position."

R-devel@r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat Jul 29 06:53:53 2006

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.1.8, at Sat 29 Jul 2006 - 00:27:25 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*