Re: [Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

From: Martin Maechler <maechler_at_stat.math.ethz.ch>
Date: Fri 28 Jul 2006 - 19:55:37 GMT

>>>>> "Kevin" == Kevin B Hendricks <kevin.hendricks@sympatico.ca> >>>>> on Fri, 28 Jul 2006 14:53:57 -0400 writes:

    [.........]

    Kevin> The idea is to somehow make functions that work well
    Kevin> over small sub- sequences of a much longer vector
    Kevin> without resorting to splitting the vector into many
    Kevin> smaller vectors.

    Kevin> In my particular case, the problem was my data frame
    Kevin> had over 1 million lines had probably over 500,000
    Kevin> unique sort keys (ie. think of it as an R factor with
    Kevin> over 500,000 levels).  The implementation of "by"
    Kevin> uses "tapply" which in turn uses "split".  So "split"
    Kevin> simply ate up all the time trying to create 500,000
    Kevin> vectors each of short length 1, 2, or 3; and the     Kevin> associated garbage collection.

Not that I have spent enough time thinking about this thread's topic, but I have seen more than one case where using tapply() unnecessarily slowed down computations.
I don't remember the details, but know that in one case, replacing tapply() by a few lines of code {one of which using lapply() IIRC}, sped up that computation by a factor (of 2 ? or more?).

I also vaguely remember that I thought about making tapply() faster, but came to the conclusion it could not be sped up quickly, because it works in a quite more general context than it was used in that application (and maybe yours?).

    Kevin> I simple loop that walked the short sequence of
    Kevin> values (since the data frame was already sorted)
    Kevin> calculating what it needed, would work much faster
    Kevin> than splitting the original vector into so very many
    Kevin> smaller vectors (and the associated copying of data).

    Kevin> That problem is very similar problem to the
    Kevin> calculation of basic stats on a short moving window     Kevin> over a very long vector.

>> The author of that message ultimately wrote the caTools R
>> package which contains some optimized versions.

    Kevin> I will look into that package and maybe use it for a     Kevin> model for what I want to do.

    Kevin> Thanks,

    Kevin> Kevin

    Kevin> ______________________________________________
    Kevin> R-devel@r-project.org mailing list     Kevin> https://stat.ethz.ch/mailman/listinfo/r-devel

R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat Jul 29 05:57:13 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Fri 28 Jul 2006 - 20:28:16 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.