Re: [Rd] Suggestions to speed up median() and has.na()

From: Henrik Bengtsson <hb_at_maths.lth.se>
Date: Tue 11 Apr 2006 - 11:56:16 GMT

On 4/11/06, Thomas Lumley <tlumley@u.washington.edu> wrote:
> On Mon, 10 Apr 2006, Duncan Murdoch wrote:
>
> > On 4/10/2006 7:22 PM, Thomas Lumley wrote:
> >> On Mon, 10 Apr 2006, Henrik Bengtsson wrote:
> >>
> >>> Suggestion 2:
> >>> Create a has.na(x) function to replace any(is.na(x)) that returns TRUE
> >>> as soon as a NA value is detected. In the best case it returns after
> >>> the first index with TRUE, in the worst case it returns after the last
> >>> index N with FALSE. The cost for is.na(x) is always O(N), and any()
> >>> in the best case O(1) and in the worst case O(N) (if any() is
> >>> implemented as I hope). An has.na() function would be very useful
> >>> elsewhere too.
> >>
> >> This sounds useful (though it has missed the deadline for 2.3.0).
> >>
> >> It won't help if the typical case is no missing values, as you suggest, but
> >> it will be faster when there are missing values.

I would still argue that is.na(x) does N comparisons and any() does at least one and in the worst case N, in total [N+1,2N], but optimally you can get away with [1,N]. In my case (see below) N is 2,000, then with 100,000 such vectors and an iteration algorithm with maxIter=20, I will save up to 4,000,000,000 comparisons!

> > I think it would help even in that case if the vector is large, because it
> > avoids allocating and disposing of the logical vector of the same length as
> > x.
>
> That makes sense. I have just tried, and for vectors of length ten
> million it does make a measurable difference.

Yes. It makes a huge difference for long vectors or many short vectors. Take for instance rlm() is MASS. With the default arguments, it uses median(abs(x)) to estimate the scale in each iteration. In my case (Affymetrix SNP data), I know for sure that 'x' never contains NAs, or even if it did, I could move that check outside the iteration loop. Even if the length of each 'x' is only 20*100, with a dataset of >100,000 such vectors, in one step, I managed to cut down the analysis from 14 hours to 7 hours! This is probably, as you say, due to allocation/copying/deallocation.

>
> >>> BTW, without having checked the source code, it looks like is.na() is
> >>> unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on
> >>> a vector without NAs. On the other hand, is.na(sum(x)) becomes
> >>> awfully slow if 'x' contains NAs.
> >>>
> >>
> >> I don't think it is unnecessarily slow. It has to dispatch methods and it
> >> has to make sure that matrix structure is preserved. After that the code
> >> is just
> >>
> >> case REALSXP:
> >> for (i = 0; i < n; i++)
> >> LOGICAL(ans)[i] = ISNAN(REAL(x)[i]);
> >> break;
> >>
> >> and it's hard to see how that can be improved. It does suggest that a
> >> faster anyNA() function would have to not be generic.
> >
> > If it's necessary to make it not generic to achieve the speedup, I don't
> > think it's worth doing. If anyNA is written not to be generic I'd guess a
> > very common error will be to apply it to a dataframe and get a misleading
> > "FALSE" answer. If we do that, I predict that the total amount of r-help
> > time wasted on it will exceed the CPU time saved by orders of magnitude.
> >
>
> I wasn't proposing that it should be stupid, just not generic. It could
> support data frames (sum(), does, for example). If it didn't support data
> frames it should certainly give an error rather than the wrong answer, but
> if we are seriously trying to avoid delays around 0.1 seconds then going
> through the generic function mechanism may be a problem.

Can we do both and let certain functions such as median() call the default/internal method explicitly? Or a low-level and a high-level function? Possibly introducing an 'hasNA' argument to many of our functions, or let 'na.rm=NA' indicate that we no that there are no NAs; when we know for sure that there is no NAs, even with an optimal anyNA() function we will waste CPU time in iterative methods.

BTW, I did a quick scan in the R source code and I found 106 occurances of "any(is.na(" not looking at any CRAN or Bioinductor packages.

Best,

Henrik

>
> -thomas
>
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue Apr 11 22:24:47 2006

This archive was generated by hypermail 2.1.8 : Tue 11 Apr 2006 - 14:17:00 GMT