Re: [Rd] Suggestions to speed up median() and has.na()

From: Bill Dunlap <bill_at_insightful.com>
Date: Mon 10 Apr 2006 - 23:42:55 GMT

On Mon, 10 Apr 2006, Thomas Lumley wrote:

> On Mon, 10 Apr 2006, Henrik Bengtsson wrote:
>
> > Hi,
> >
> > I've got two suggestions how to speed up median() about 50%. For all
> > iterative methods calling median() in the loops this has a major
> > impact. The second suggestion will apply to other methods too.

>
> > Suggestion 2:
> > Create a has.na(x) function to replace any(is.na(x)) that returns TRUE
> > as soon as a NA value is detected. In the best case it returns after
> > the first index with TRUE, in the worst case it returns after the last
> > index N with FALSE. The cost for is.na(x) is always O(N), and any()
> > in the best case O(1) and in the worst case O(N) (if any() is
> > implemented as I hope). An has.na() function would be very useful
> > elsewhere too.
>
> This sounds useful (though it has missed the deadline for 2.3.0).
>
> It won't help if the typical case is no missing values, as you suggest,
> but it will be faster when there are missing values.

Splus has such a function, but it is called anyMissing(). In the interests of interoperability it would be nice if R used that name. (I did not choose the name, but that is what it is.)

The following experiment using Splus seems to indicate the speedup has less to do with stopping at the first NA than it does with not making/filling/copying/whatever the big vector of logicals that is.na returns.

   > # NA near start of list of 10 million integers    > { z<-replace(1:1e7,2,NA); unix.time(anyMissing(z)) }    [1] 0 0 0 0 0
   > { z<-replace(1:1e7,2,NA); unix.time(any(is.na(z)))}    [1] 0.62 0.13 0.75 0.00 0.00

   > # NA at end of list
   > { z<-replace(1:1e7,1e7,NA); unix.time(anyMissing(z)) }    [1] 0.07 0.00 0.07 0.00 0.00
   > { z<-replace(1:1e7,1e7,NA); unix.time(any(is.na(z)))}    [1] 0.64 0.11 0.75 0.00 0.00

The Splus anyMissing is an s3 generic (i.e., it calls UseMethod()). The Splus is.na is an s4 generic and its default method may invoke an s3 generic.

> > BTW, without having checked the source code, it looks like is.na() is
> > unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on
> > a vector without NAs. On the other hand, is.na(sum(x)) becomes
> > awfully slow if 'x' contains NAs.
> >
>
> I don't think it is unnecessarily slow. It has to dispatch methods and
> it has to make sure that matrix structure is preserved. After that the
> code is just
>
> case REALSXP:
> for (i = 0; i < n; i++)
> LOGICAL(ans)[i] = ISNAN(REAL(x)[i]);
> break;
>
> and it's hard to see how that can be improved. It does suggest that a
> faster anyNA() function would have to not be generic.



Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do  not necessarily reflect Insightful Corporation policy or position."



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue Apr 11 10:11:15 2006

This archive was generated by hypermail 2.1.8 : Tue 11 Apr 2006 - 02:17:05 GMT