Re: [Rd] bug in sum() on integer vector

From: Ted Harding <ted.harding_at_wlandres.net>
Date: Wed, 14 Dec 2011 00:52:53 +0000 (GMT)


[See at end]

On 13-Dec-11 23:41:12, Hervé Pagès wrote:
> Hi Duncan,
>
> On 11-12-10 05:27 AM, Duncan Murdoch wrote:

>> On 11-12-09 4:41 PM, Hervé Pagès wrote:
>>> Hi Duncan,
>>>
>>> On 11-12-09 11:39 AM, Duncan Murdoch wrote:
>>>> On 09/12/2011 1:40 PM, Hervé Pagès wrote:
>>>>> Hi,
>>>>>
>>>>> x<- c(rep(1800000003L, 10000000), -rep(1200000002L, 15000000))
>>>>>
>>>>> This is correct:
>>>>>
>>>>>> sum(as.double(x))
>>>>> [1] 0
>>>>>
>>>>> This is not:
>>>>>
>>>>>> sum(x)
>>>>> [1] 4996000
>>>>>
>>>>> Returning NA (with a warning) would also be acceptable for the
>>>>> latter.
>>>>> That would make it consistent with cumsum(x):
>>>>>
>>>>>> cumsum(x)[length(x)]
>>>>> [1] NA
>>>>> Warning message:
>>>>> Integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'
>>>>
>>>> This is a 64 bit problem; in 32 bits things work out properly.
>>>> I'd guess
>>>> in 64 bit arithmetic we or the run-time are doing something to
>>>> simulate
>>>> 32 bit arithmetic (since integers are 32 bits), but it looks as
>>>> though
>>>> we're not quite getting it right.
>>>
>>> It doesn't work properly for me on Leopard (32-bit mode):
>>>
>>> > x<- c(rep(1800000003L, 10000000), -rep(1200000002L, 15000000))
>>> > sum(as.double(x))
>>> [1] 0
>>> > sum(x)
>>> [1] 4996000
>>> > sessionInfo()
>>> R version 2.14.0 RC (2011-10-27 r57452)
>>> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>>>
>>> locale:
>>> [1] C
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> It looks like the problem is that isum() (in src/main/summary.c)
>>> uses a 'double' internally to do the sum, whereas rsum() and csum()
>>> use a 'long double'.
>>
>> A double has 53 bits to store the mantissa, so any 32 bit integer can
>> be
>> stored exactly.
>>
>>>
>>> Note that isum() seems to be assuming that NA_INTEGER and NA_LOGICAL
>>> will always be the same (probably fine) and that TRUE values in the
>>> input vector are always represented as a 1 (not so sure about this
>>> one).
>>>
>>> A more fundamental question: is switching back and forth between
>>> 'int' and 'double' (or 'long double') the right thing to do for doing
>>> "safe" arithmetic on integers?
>>
>> If you have enough terms in the sum that an intermediate value exceeds
>> 53 bits in length, then you'll get the wrong answer, because the
>> intermediate sum can't be stored exactly. That happens in your
>> example.
>> On the 32 bit platform I tested (Windows 32 bit), intermediate values
>> are stored in registers with 64 bit precision, which is probably why
>> Windows 32 bit gets it right, but various other platforms don't.
>>
>> On your fundamental question: I think the answer is that R is doing
>> the
>> right thing. R doesn't think of an integer as a particular
>> representation, it thinks of it as a number. So if you ask for the sum
>> of those numbers, R should return its best approximation to that sum,
>> and it does.

>
> It does, really? Seems like returning 0 would be a better approximation
> ;-) And with the argument that "R doesn't think of an integer as a
> particular representation" then there is no reason why sum(x)
> would get it wrong and sum(as.double(x)) would get it right. Also why
> bother having an integer type in R?
>
> Seriously, I completely disagree with your view (hopefully it's only
> yours, and not an R "feature") that it's ok for integer arithmetic to
> return an approximation. It should always return the correct value or
> fail. This is one of the reasons why programmers use integers and not
> floating point numbers (memory usage being another one). Integers are
> used for indexing elements in an array or for shifting pointers at the
> C-level. The idea that integer arithmetic can be approximate is scary.
>
> Cheers,
> H.
> Hervé Pagès
> [...]

The approximation is inevitable, even for integers, if the integers are large enough.

The number of particles in the Universe is believed to be around 10^80 (estimates vary from 10^72 to 10^87). If we could use each particle as a binary element in storage (e.g. using positive and negative spin as 1 or 0) then the largest integer that could be stored in the entire Universe would be about 2^(10^80). A huge number, of course, but it is the limit of what is possible.

Now, to do arithmetic with integers, you need to store two ort more integers, thus at least halving the power of 2. Then you need a computer to do the computation, and you won't have room for that in the Universe unless you cut down on the largest integer.

So, in real life (whatever Mathematics may say), you can't expect arbitrary integer arithmetic.

Now, computer programs for numerical computation can broadly be divided into two types.

In one, "arbitrary precision" is available: you can tell the program how many decimal digits you want it to work to. An example of this is 'bc':

  http://en.wikipedia.org/wiki/Bc_programming_language

You can set as many decimal ditgits as you like, *provided* they fall within the storage capacity of your computer, for which an upper bound is the storage capacity of the Universe (see above). For integers and results which surpass the decimal places you have set, the result will be an approximation. Inevitably.

In the other type, the program is written so as to embody integers to a fixed maximum number of decimal (or binary) digits. An example of this is R (and most other numerical programs). This may be 32 bits or 64 bits. Any result ot computation which involve smore than this numer of bits is inevitably an approximation.

Provided the user is aware of this, there is no need for your "It should always return the correct value or fail." It will return the correct value if the integers are not too large; otherwise it will retuirn the best approximation that it can cope with in the fixed finite storage space for which it has been programmed.

There is an implcit element of the arbitrary in this. You can install 32-bit R on a 64-bit-capable machine, or a 64-bit version. You could re-program R so that it can work to, say, 128 bits or 256 bits even on a 32-bit machine (using techniques like those that underlie 'bc'), but that would be an arbitrary choice. However, the essential point is that some choice is unavoidable, since if you push it too far the Universe will run out of particles -- and the computer industry will run out of transistors long before you hit the Universe limit!

So you just have to accept the limits. Provided you are aware of the approximations which may set in at some point, you can cope with the consequences, so long as you take account of some concept of "adequacy" in the inevitable approximations. Simply to "fail" is far too unsophisticated a result!

Hoping this is useful,
Ted.



E-Mail: (Ted Harding) <ted.harding_at_wlandres.net> Fax-to-email: +44 (0)870 094 0861
Date: 14-Dec-11                                       Time: 00:52:49
------------------------------ XFMail ------------------------------

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 14 Dec 2011 - 00:55:50 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 14 Dec 2011 - 02:00:17 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive