Re: [Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

From: Kevin B. Hendricks <kevin.hendricks_at_sympatico.ca>
Date: Fri 28 Jul 2006 - 18:53:57 GMT

Hi,

> There was a performance comparison of several moving average
> approaches here:
> http://tolstoy.newcastle.edu.au/R/help/04/10/5161.html
>

Thanks for that link. It is not quite the same thing but is very similar.

The idea is to somehow make functions that work well over small sub- sequences of a much longer vector without resorting to splitting the vector into many smaller vectors.

In my particular case, the problem was my data frame had over 1 million lines had probably over 500,000 unique sort keys (ie. think of it as an R factor with over 500,000 levels). The implementation of "by" uses "tapply" which in turn uses "split". So "split" simply ate up all the time trying to create 500,000 vectors each of short length 1, 2, or 3; and the associated garbage collection.

I simple loop that walked the short sequence of values (since the data frame was already sorted) calculating what it needed, would work much faster than splitting the original vector into so very many smaller vectors (and the associated copying of data).

That problem is very similar problem to the calculation of basic stats on a short moving window over a very long vector.

> The author of that message ultimately wrote the caTools R package
> which contains some optimized versions.

I will look into that package and maybe use it for a model for what I want to do.

Thanks,

Kevin



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat Jul 29 04:56:43 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Fri 28 Jul 2006 - 22:27:45 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.