Re: [Rd] Any interest in "merge" and "by" implementations specifically for sorted data?

From: Kevin B. Hendricks <kevin.hendricks_at_sympatico.ca>
Date: Sat 29 Jul 2006 - 20:05:09 GMT

Hi Bill,

So you wrote one routine that can calculate any single of a variety of stats and allows weights, is that right? Can it return a data frame of any subset of requested stats as well (that is what I was thinking of doing anyway).

I think someone can easily calculate all of those things in one pass through the array and then allow the user to select which of the new columns of stats should be added to a data.frame that is returned to the user.

To test all of this, I simply wrote my own igroupSums and integrated it into r-devel based on the code in split.c. I can easily modify it to handle the case of calculating a variety of stats (even all at the same time if desired). I do not deal with "weights" at all and ignored that for now.

Here is what your test case now shows on my machine with the latest R build
with my added igroupSums routine (added internally to R).

> x <- rnorm(2e6)
> i <- rep(1:1e6,2)
> unix.time(Asums <- unlist(lapply(split(x,i),sum)))
[1] 8.940 0.112 9.053 0.000 0.000

> names(Asums) <- NULL

My version of igroupSums does not keep the names so I remove them to make the results comparable.

Here is my my own internal function igroupSums

> unix.time(Bsums <- igroupSums(x,i))

[1] 0.932 0.024 0.958 0.000 0.000
>
> all.equal(Asums, Bsums)

[1] TRUE So the speed up is quite significant (9.053 seconds vs 0.858 seconds).

I will next modify my code to handle any single one of maxs, mins, sums, counts, anys, alls, means, prods, and ranges by user choice. Although I will leave the use of weights as unimplemented for now (I always get mixed up thinking about weights and basic stats and I never use them so ...)

In case others want to play around with this too, here is the R wrapper in igroupSums.R to put in src/library/base/R/

igroupSums <- function(x, f, drop = FALSE, ...) UseMethod("igroupSums")

igroupSums.default <- function(x, f, drop=FALSE, ...) {

     if(length(list(...))) .NotYetUsed(deparse(...), error = FALSE)

     if (is.list(f)) f <- interaction(f, drop = drop)
     else if (drop || !is.factor(f)) # drop extraneous levels
         f <- factor(f)
     storage.mode(f) <- "integer"  # some factors have double
     if (is.null(attr(x, "class")))
         return(.Internal(igroupSums(x, f)))
     ## else
     r <- by(x,f,sum)
     r

}

igroupSums.data.frame <- function(x, f, drop = FALSE, ...)

     lapply(split(seq(length=nrow(x)), f, drop = drop, ...),
            function(ind) x[ind, , drop = FALSE])


And here is a very simple igroupSums.c to put in src/main/

It still needs a lot of work since it does not handle NAs in the vector x yet and still needs to be modified into a general routine to handle any single function of counts, sums, maxs, mins, means, prods, anys, alls, and ranges

#ifdef HAVE_CONFIG_H
#include <config.h>
#endif

#include "Defn.h"

SEXP attribute_hidden do_igroupSums(SEXP call, SEXP op, SEXP args, SEXP env)
{

     SEXP x, f, sums;
     int i, j, nobs, nlevs, nfac;

     checkArity(op, args);

     x = CAR(args);
     f = CADR(args);
     if (!isVector(x))
         errorcall(call, _("first argument must be a vector"));
     if (!isFactor(f))
         errorcall(call, _("second argument must be a factor"));
     nlevs = nlevels(f);
     nfac = LENGTH(CADR(args));
     nobs = LENGTH(CAR(args));
     if (nobs <= 0)
         return R_NilValue;
     if (nfac <= 0)
         errorcall(call, _("Group length is 0 but data length > 0"));
     if (nobs % nfac != 0)
         warningcall(call, _("data length is not a multiple of split  
variable"));
     PROTECT(sums = allocVector(TYPEOF(x), nlevs));
     switch (TYPEOF(x)) {
     case INTSXP:
         for (i=0; i < nlevs; i++) INTEGER(sums)[i] = 0;
         break;
     case REALSXP:
         for (i=0; i < nlevs; i++) REAL(sums)[i] = 0.0;
         break;
     default:
         UNIMPLEMENTED_TYPE("igroupSums", x);
     }
     for (i = 0;  i < nobs; i++) {
         j = INTEGER(f)[i % nfac];
         if (j != NA_INTEGER) {
             j--;
             switch (TYPEOF(x)) {
             case INTSXP:
                 INTEGER(sums)[j] = INTEGER(sums)[j] + INTEGER(x)[i];
                 break;
             case REALSXP:
                 REAL(sums)[j] = REAL(sums)[j] + REAL(x)[i];
                 break;
             default:
                 UNIMPLEMENTED_TYPE("igroupSums", x);
             }
         }
     }
     UNPROTECT(1);
     return sums;

}

If anyone is playing with this themselves, don't forget to update Internal.h and names.c to reflect the added routine before you make clean and then rebuild.

Once I finish, I will post me patches here and then if someone would like to modify them to implement "weights", please let me know.

Even if these never get added to R I can keep them in my own tree and use them for my own work.

Thanks again for all of your hints and guidance. This alone will speed up my R code greatly!

Kevin

> That is roughly what I did in C code for the Splus version.
> E.g., here is the integer x case for sums and means.  It accumlates
> the weighted group sum and the group weight so that if we
> are doing the mean it has the right divisor.  It would
> go a bit faster if I didn't bother with the group weight
> in the sum case.
>
> (I was mostly interested in seeing if this function's interface
> was sufficiently general for your uses.  Computing the integer
> group codes can sometimes be a pain and there are cases where you
> might want to combine that with computing the grouped summaries.
> I am guessing that in most cases you want those two functions
> to be separate.)
>
> case S_SUM: /*FALLTHROUGH*/
> case S_MEAN:
>     for(i=0;i<length;i++) {
>         bin = binp ? *binp++ - 1 : 0 ; /* bin is now 0-based */
>         weight = weighted ? *weightp++ : 1.0 ;
>         x = *xp++ ;
>         if (is_na_INT(&bin) || bin<0 || bin>=maxbin || weight==0.0)
>             continue ;
>         if (is_na_DOUBLE(&groupWeightp[bin]))
>             continue ;
>         if (is_na_DOUBLE(&x) || is_na_DOUBLE(&weight)) {
>             if (!na_rm) {
>                 na_set3(&valuep[bin], value->mode, To_NA);
>                 na_set3(&groupWeightp[bin], groupWeight->mode, To_NA);
>             }
>             continue ;
>         }
>         if (!is_na_DOUBLE(&valuep[bin])) {
>             valuep[bin] += x * weight ;
>             groupWeightp[bin] += weight ; /* groupWeightp and  
> valuep should both have same NA-ness */
>         }
>     }
>     if (which==S_MEAN) {
>         for(ibin=0;ibin<maxbin;ibin++) {
>             if (is_na_DOUBLE(&valuep[ibin])) {
>                 /* leave valuep[ibin] as NA */
>             } else if (groupWeightp[ibin]==0.0) {
>                 /* valuep[ibin] must be 0.0 if groupWeightp[ibin]  
> is */
>                 na_set3(&valuep[ibin], value->mode, To_NaN) ;
>             } else {
>                 valuep[ibin] = valuep[ibin] / groupWeightp[ibin] ;
>             }
>         }
>     }
>     break;

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sun Jul 30 06:10:23 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Sun 30 Jul 2006 - 14:27:00 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.