# Re: [Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

From: Martin Maechler <maechler_at_stat.math.ethz.ch>
Date: Fri 28 Jul 2006 - 19:55:37 GMT

>>>>> "Kevin" == Kevin B Hendricks <kevin.hendricks@sympatico.ca> >>>>> on Fri, 28 Jul 2006 14:53:57 -0400 writes:

[.........]

```    Kevin> The idea is to somehow make functions that work well
Kevin> over small sub- sequences of a much longer vector
Kevin> without resorting to splitting the vector into many
Kevin> smaller vectors.

Kevin> In my particular case, the problem was my data frame
Kevin> unique sort keys (ie. think of it as an R factor with
Kevin> over 500,000 levels).  The implementation of "by"
Kevin> uses "tapply" which in turn uses "split".  So "split"
Kevin> simply ate up all the time trying to create 500,000
```
Kevin> vectors each of short length 1, 2, or 3; and the     Kevin> associated garbage collection.

Not that I have spent enough time thinking about this thread's topic, but I have seen more than one case where using tapply() unnecessarily slowed down computations.
I don't remember the details, but know that in one case, replacing tapply() by a few lines of code {one of which using lapply() IIRC}, sped up that computation by a factor (of 2 ? or more?).

I also vaguely remember that I thought about making tapply() faster, but came to the conclusion it could not be sped up quickly, because it works in a quite more general context than it was used in that application (and maybe yours?).

```    Kevin> I simple loop that walked the short sequence of
Kevin> values (since the data frame was already sorted)
Kevin> calculating what it needed, would work much faster
Kevin> than splitting the original vector into so very many
Kevin> smaller vectors (and the associated copying of data).

Kevin> That problem is very similar problem to the
```
Kevin> calculation of basic stats on a short moving window     Kevin> over a very long vector.

>> The author of that message ultimately wrote the caTools R
>> package which contains some optimized versions.

Kevin> I will look into that package and maybe use it for a     Kevin> model for what I want to do.

Kevin> Thanks,

Kevin> Kevin

```    Kevin> ______________________________________________
```
Kevin> R-devel@r-project.org mailing list     Kevin> https://stat.ethz.ch/mailman/listinfo/r-devel

R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat Jul 29 05:57:13 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Fri 28 Jul 2006 - 20:28:16 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.