Re: [Rd] Suggestions to speed up median() and has.na()

From: Thomas Lumley <tlumley_at_u.washington.edu>
Date: Tue 11 Apr 2006 - 00:08:02 GMT

On Mon, 10 Apr 2006, Duncan Murdoch wrote:

> On 4/10/2006 7:22 PM, Thomas Lumley wrote:
>> On Mon, 10 Apr 2006, Henrik Bengtsson wrote:
>>
>>> Suggestion 2:
>>> Create a has.na(x) function to replace any(is.na(x)) that returns TRUE
>>> as soon as a NA value is detected. In the best case it returns after
>>> the first index with TRUE, in the worst case it returns after the last
>>> index N with FALSE. The cost for is.na(x) is always O(N), and any()
>>> in the best case O(1) and in the worst case O(N) (if any() is
>>> implemented as I hope). An has.na() function would be very useful
>>> elsewhere too.
>>
>> This sounds useful (though it has missed the deadline for 2.3.0).
>>
>> It won't help if the typical case is no missing values, as you suggest, but
>> it will be faster when there are missing values.
>
> I think it would help even in that case if the vector is large, because it
> avoids allocating and disposing of the logical vector of the same length as
> x.

That makes sense. I have just tried, and for vectors of length ten million it does make a measurable difference.

>>> BTW, without having checked the source code, it looks like is.na() is
>>> unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on
>>> a vector without NAs. On the other hand, is.na(sum(x)) becomes
>>> awfully slow if 'x' contains NAs.
>>>
>>
>> I don't think it is unnecessarily slow. It has to dispatch methods and it
>> has to make sure that matrix structure is preserved. After that the code
>> is just
>>
>> case REALSXP:
>> for (i = 0; i < n; i++)
>> LOGICAL(ans)[i] = ISNAN(REAL(x)[i]);
>> break;
>>
>> and it's hard to see how that can be improved. It does suggest that a
>> faster anyNA() function would have to not be generic.
>
> If it's necessary to make it not generic to achieve the speedup, I don't
> think it's worth doing. If anyNA is written not to be generic I'd guess a
> very common error will be to apply it to a dataframe and get a misleading
> "FALSE" answer. If we do that, I predict that the total amount of r-help
> time wasted on it will exceed the CPU time saved by orders of magnitude.
>

I wasn't proposing that it should be stupid, just not generic. It could support data frames (sum(), does, for example). If it didn't support data frames it should certainly give an error rather than the wrong answer, but if we are seriously trying to avoid delays around 0.1 seconds then going through the generic function mechanism may be a problem.

         -thomas



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue Apr 11 10:53:50 2006

This archive was generated by hypermail 2.1.8 : Tue 11 Apr 2006 - 14:17:00 GMT