Re: [Rd] Suggestions to speed up median() and

From: Duncan Murdoch <>
Date: Tue 11 Apr 2006 - 00:54:39 GMT

On 4/10/2006 8:08 PM, Thomas Lumley wrote:
> On Mon, 10 Apr 2006, Duncan Murdoch wrote:

>> On 4/10/2006 7:22 PM, Thomas Lumley wrote:
>>> On Mon, 10 Apr 2006, Henrik Bengtsson wrote:
>>>> Suggestion 2:
>>>> Create a function to replace any( that returns TRUE
>>>> as soon as a NA value is detected.  In the best case it returns after
>>>> the first index with TRUE, in the worst case it returns after the last
>>>> index N with FALSE.  The cost for is always O(N), and any()
>>>> in the best case O(1) and in the worst case O(N) (if any() is
>>>> implemented as I hope).  An function would be very useful
>>>> elsewhere too.
>>> This sounds useful (though it has missed the deadline for 2.3.0).
>>> It won't help if the typical case is no missing values, as you suggest, but 
>>> it will be faster when there are missing values.
>> I think it would help even in that case if the vector is large, because it 
>> avoids allocating and disposing of the logical vector of the same length as 
>> x.

> That makes sense. I have just tried, and for vectors of length ten
> million it does make a measurable difference.
>>>> BTW, without having checked the source code, it looks like is
>>>> unnecessarily slow; is much faster than any( on
>>>> a vector without NAs.  On the other hand, becomes
>>>> awfully slow if 'x' contains NAs.
>>> I don't think  it is unnecessarily slow.  It has to dispatch methods and it 
>>> has to make sure that matrix structure is preserved.  After that the code 
>>> is just
>>>      case REALSXP:
>>>          for (i = 0; i < n; i++)
>>>              LOGICAL(ans)[i] = ISNAN(REAL(x)[i]);
>>>          break;
>>> and it's hard to see how that can be improved. It does suggest that a 
>>> faster anyNA() function would have to not be generic.
>> If it's necessary to make it not generic to achieve the speedup, I don't 
>> think it's worth doing.  If anyNA is written not to be generic I'd guess a 
>> very common error will be to apply it to a dataframe and get a misleading 
>> "FALSE" answer.  If we do that, I predict that the total amount of r-help 
>> time wasted on it will exceed the CPU time saved by orders of magnitude.

> I wasn't proposing that it should be stupid, just not generic. It could
> support data frames (sum(), does, for example). If it didn't support data
> frames it should certainly give an error rather than the wrong answer, but
> if we are seriously trying to avoid delays around 0.1 seconds then going
> through the generic function mechanism may be a problem.

If it's not dataframes, it will be something else. I think it's highly desirable that any( == anyNA(x) within base packages, and we should make it straightforward to maintain this identity in contributed packages.

By the way, I think Bill's suggestion of calling it anyMissing makes a lot of sense.

Duncan mailing list Received on Tue Apr 11 11:29:31 2006

This archive was generated by hypermail 2.1.8 : Tue 11 Apr 2006 - 16:17:00 GMT