Re: [Rd] bug in sum() on integer vector

From: peter dalgaard <pdalgd_at_gmail.com>
Date: Thu, 15 Dec 2011 11:40:23 +0100

On Dec 15, 2011, at 02:51 , Hervé Pagès wrote:

> Hi Peter,
>
> On 11-12-14 08:19 AM, peter dalgaard wrote:

>> 
>> On Dec 14, 2011, at 16:19 , John C Nash wrote:
>> 
>>> 
>>> Following this thread, I wondered why nobody tried cumsum to see where the integer
>>> overflow occurs. On the shorter xx vector in the little script below I get a message:
>>> 
>>> Warning message:
>>> Integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'
>>>> 
>>> 
>>> But sum() does not give such a warning, which I believe is the point of contention. Since
>>> cumsum() does manage to give such a warning, and show where the overflow occurs, should
>>> sum() not be able to do so? For the record, I don't class the non-zero answer as an error
>>> in itself. I regard the failure to warn as the issue.
>> 
>> It (sum) does warn if you take the two "halves" separately. The issue is that the overflow is detected at the end of the summation, when the result is to be saved to an integer (which of course happens for all intermediate sums in cumsum)
>> 
>>> x<- c(rep(1800000003L, 10000000), -rep(1200000002L, 15000000))
>>> sum(x[1:10000000])
>> [1] NA
>> Warning message:
>> In sum(x[1:1e+07]) : Integer overflow - use sum(as.numeric(.))
>>> sum(x[10000001:25000000])
>> [1] NA
>> Warning message:
>> In sum(x[10000001:1.5e+07]) : Integer overflow - use sum(as.numeric(.))
>>> sum(x)
>> [1] 4996000
>> 
>> There's a pretty easy fix, essentially to move
>> 
>>     if(s>  INT_MAX || s<  R_INT_MIN){
>>         warningcall(call, _("Integer overflow - use sum(as.numeric(.))"));
>>         *value = NA_INTEGER;
>>     }
>> 
>> inside the summation loop. Obviously, there's a speed penalty from two FP comparisons per element, but I wouldn't know whether it matters in practice for anyone.
>> 

>
> Since you want to generate this warning once only, your test (now
> inside the loop) needs to be something like:
>
> if (warn && (s > INT_MAX || s < R_INT_MIN)) {
> generate the warning
> warn = 0;
> }
>
> with 'warn' initialized to 1. This makes the isum() function almost
> twice slower on my machine (64-bit Ubuntu) when compiling with
> gcc -O2 and when no overflow occurs (the most common use case I guess).
>
> Why not just do the sum in a long double instead of a double?
> It slows down isum() by only 8% on my machine when compiling
> with gcc -O2.
> But most importantly this solution also has the advantage of making
> sum(x) consistent with sum(as.double(x)). The latter uses rsum() which
> does the sum in a long double. So by using a long double in both isum()
> and rsum(), consistency between sum(x) and sum(as.double(x)) is
> guaranteed.

Hum, yes. Also the test would be overly cautious: The real thing to test is whether we overrun the range in which integers are exactly representable in FP i.e. roughly +/-2^52, not the +/-2^31 that fits 32 bit integers. Or +/-2^63 if we have long doubles.

However, we still need to decide whether the issue is that sum(as.double(x)) can be inconsistent with sum(x), or whether it is that integer arithmetic can be inexact. Also, the timings should really be viewed in context: Does _any_ actual code use isum to an extent where halving its speed would have any noticeable impact?

We probably shouldn't touch this for 2.14.1, then.

> Maybe that still doesn't give you the guarantee that sum(x) will always
> return the correct value (when it does not return NA) because that
> depends now on the ability of long double to represent exactly the sum
> of at most INT_MAX arbitrary ints. The nb of bits used for long double
> seems to vary a lot across platforms/compilers so it's hard to tell.
> Not an ideal solution, but at least it makes isum() more accurate than
> the current isum() and it makes sum(x) consistent with sum(as.double(x))
> on all platforms, without degrading performance too much.
>
> Cheers,
> H.
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages_at_fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319

-- 
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes_at_cbs.dk  Priv: PDalgd_at_gmail.com

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Thu 15 Dec 2011 - 10:43:23 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 15 Dec 2011 - 13:10:17 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive