Re: [Rd] bug in sum() on integer vector

From: Duncan Murdoch <murdoch.duncan_at_gmail.com>
Date: Sat, 10 Dec 2011 08:27:42 -0500

On 11-12-09 4:41 PM, Hervé Pagès wrote:
> Hi Duncan,
>
> On 11-12-09 11:39 AM, Duncan Murdoch wrote:
>> On 09/12/2011 1:40 PM, Hervé Pagès wrote:
>>> Hi,
>>>
>>> x<- c(rep(1800000003L, 10000000), -rep(1200000002L, 15000000))
>>>
>>> This is correct:
>>>
>>>> sum(as.double(x))
>>> [1] 0
>>>
>>> This is not:
>>>
>>>> sum(x)
>>> [1] 4996000
>>>
>>> Returning NA (with a warning) would also be acceptable for the latter.
>>> That would make it consistent with cumsum(x):
>>>
>>>> cumsum(x)[length(x)]
>>> [1] NA
>>> Warning message:
>>> Integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'
>>
>> This is a 64 bit problem; in 32 bits things work out properly.
>> I'd guess
>> in 64 bit arithmetic we or the run-time are doing something to simulate
>> 32 bit arithmetic (since integers are 32 bits), but it looks as though
>> we're not quite getting it right.
>
> It doesn't work properly for me on Leopard (32-bit mode):
>
> > x<- c(rep(1800000003L, 10000000), -rep(1200000002L, 15000000))
> > sum(as.double(x))
> [1] 0
> > sum(x)
> [1] 4996000
> > sessionInfo()
> R version 2.14.0 RC (2011-10-27 r57452)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> It looks like the problem is that isum() (in src/main/summary.c)
> uses a 'double' internally to do the sum, whereas rsum() and csum()
> use a 'long double'.

A double has 53 bits to store the mantissa, so any 32 bit integer can be stored exactly.

>
> Note that isum() seems to be assuming that NA_INTEGER and NA_LOGICAL
> will always be the same (probably fine) and that TRUE values in the
> input vector are always represented as a 1 (not so sure about this one).
>
> A more fundamental question: is switching back and forth between
> 'int' and 'double' (or 'long double') the right thing to do for doing
> "safe" arithmetic on integers?

If you have enough terms in the sum that an intermediate value exceeds 53 bits in length, then you'll get the wrong answer, because the intermediate sum can't be stored exactly. That happens in your example. On the 32 bit platform I tested (Windows 32 bit), intermediate values are stored in registers with 64 bit precision, which is probably why Windows 32 bit gets it right, but various other platforms don't.

On your fundamental question: I think the answer is that R is doing the right thing. R doesn't think of an integer as a particular representation, it thinks of it as a number. So if you ask for the sum of those numbers, R should return its best approximation to that sum, and it does.

A different approach would be to do the sum in 32 bit registers and detect 32 bit overflow in intermediate results. But that's a very hardware-oriented approach, rather than a mathematical approach.

Duncan Murdoch

> Thanks!
> H.
>
>
>>
>> Duncan Murdoch
>>
>>> Thanks!
>>> H.
>>>
>>>> sessionInfo()
>>> R version 2.14.0 (2011-10-31)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
>>> [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
>>> [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
>>> [7] LC_PAPER=C LC_NAME=C
>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>
>
>



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat 10 Dec 2011 - 13:37:25 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 14 Dec 2011 - 00:20:18 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive