Re: [Rd] Reading 64-bit integers

From: Simon Urbanek <simon.urbanek_at_r-project.org>
Date: Wed, 30 Mar 2011 14:22:53 -0400

Bill,

thanks. I like that idea of the output parameter better, especially if we ever add different scalar vector types. Admittedly, what=integer() is the most useful case. What I was worried about is things like what=double(), output=integer() which could be legal, but are more conveniently dealt with via as.integer(readBin()) instead. I won't have more time today, but I'll have a look tomorrow.

Thanks,
Simon

On Mar 30, 2011, at 1:38 PM, William Dunlap wrote:

> 

>> -----Original Message-----
>> From: r-devel-bounces@r-project.org
>> [mailto:r-devel-bounces_at_r-project.org] On Behalf Of Simon Urbanek
>> Sent: Tuesday, March 29, 2011 6:49 PM
>> To: Duncan Murdoch
>> Cc: r-devel_at_r-project.org
>> Subject: Re: [Rd] Reading 64-bit integers
>>
>>
>> On Mar 29, 2011, at 8:47 PM, Duncan Murdoch wrote:
>>
>>> On 29/03/2011 7:01 PM, Jon Clayden wrote:
>>>> Dear Simon,
>>>> 
>>>> On 29 March 2011 22:40, Simon 

>> Urbanek<simon.urbanek_at_r-project.org> wrote:
>>>>> Jon,
>>>>> 
>>>>> On Mar 29, 2011, at 1:33 PM, Jon Clayden wrote:
>>>>> 
>>>>>> Dear Simon,
>>>>>> 
>>>>>> Thank you for the response.
>>>>>> 
>>>>>> On 29 March 2011 15:06, Simon 

>> Urbanek<simon.urbanek_at_r-project.org> wrote:
>>>>>>> 
>>>>>>> On Mar 29, 2011, at 8:46 AM, Jon Clayden wrote:
>>>>>>> 
>>>>>>>> Dear all,
>>>>>>>> 
>>>>>>>> I see from some previous threads that support for 

>> 64-bit integers in R
>>>>>>>> may be an aim for future versions, but in the meantime 

>> I'm wondering
>>>>>>>> whether it is possible to read in integers of greater 

>> than 32 bits at
>>>>>>>> all. Judging from ?readBin, it should be possible to 

>> read 8-byte
>>>>>>>> integers to some degree, but it is clearly limited in 

>> practice by R's
>>>>>>>> internally 32-bit integer type:
>>>>>>>> 
>>>>>>>>> x<- as.raw(c(0,0,0,0,1,0,0,0))
>>>>>>>>> (readBin(x,"integer",n=1,size=8,signed=F,endian="big"))
>>>>>>>> [1] 16777216
>>>>>>>>> x<- as.raw(c(0,0,0,1,0,0,0,0))
>>>>>>>>> (readBin(x,"integer",n=1,size=8,signed=F,endian="big"))
>>>>>>>> [1] 0
>>>>>>>> 
>>>>>>>> For values that fit into 32 bits it works fine, but 

>> for larger values
>>>>>>>> it fails. (I'm a bit surprised by the zero - should 

>> the value not be
>>>>>>>> NA if it is out of range?
>>>>>>> 
>>>>>>> No, it's not out of range - int is only 4 bytes so only 

>> 4 first bytes (respecting endianness order, hence LSB) are used.
>>>>>> 
>>>>>> The fact remains that I ask for the value of an 8-byte 

>> integer and
>>>>>> don't get it.
>>>>> 
>>>>> I think you're misinterpreting the documentation:
>>>>> 
>>>>>    If 'size' is specified and not the natural size of the object,
>>>>>    each element of the vector is coerced to an appropriate type
>>>>>    before being written or as it is read.
>>>>> 
>>>>> The "integer" object type is defined as signed 32-bit in 

>> R, so if you ask for "8 bytes into object type integer", you
>> get a coercion into that object type -- 32-bit signed integer
>> -- as documented. I think the issue may come from the
>> confusion of the object type "integer" with general "integer
>> number" in mathematical sense that has no representation
>> restrictions. (FWIW in C the "integer" type is "int" and it
>> is 32-bit on all modern OSes regardless of platform - that's
>> where the limitation comes from, it's not something R has made up).
>>>> 
>>>> OK, but it still seems like there is a case for raising a 

>> warning. As
>>>> it is there is no way to tell when reading an 8-byte integer from a
>>>> file whether its value is really 0, or if it merely has 0 in its
>>>> least-significant 4 bytes. If 99% of such stored numbers are below
>>>> 2^31, one is going to need some extra logic to catch the other 1%
>>>> where you (silently) get the wrong value. In essence, unless you're
>>>> certain that you will never come across a number that actually uses
>>>> the upper 4 bytes, you will always have to read it as two 4-byte
>>>> numbers and check that the high-order one (which is endianness
>>>> dependent, of course) is zero. A C-level sanity check seems more
>>>> efficient and more helpful to me.
>>> 
>>> Seems to me that the S-PLUS solution (output="double") 

>> would be a lot more useful. I'd commit that if you write it;
>> I don't think I'd commit the warning.
>>> 

>>
>> I was going to write some thing similar (idea = good, patch
>> welcome ;)). My only worry is that the "output" argument is a
>> bit misleading in that one could expect to use any
>> combination of "input"/"output" which may be a maintenance
>> nightmare. If I understand it correctly it's only a special
>> case for integer input. I don't have S+ so can't say how they
>> deal with that.
> 
> In S+'s readBin the output argument can be
> only double() or single() when what is double()
> or single() (S+ still  has a real single
> precision storage mode) and can be any
> numeric type or logical when what is integer().
> 
> The output=double() seemed like the only useful case.
> 
> It does not warn when precision is lost in the 8-byte
> integer to double conversion.  Perhaps it should.
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com  
> 

>>
>> Cheers,
>> Simon
>>
>>
>>> 
>>>> 
>>>>>> Pretending that it's really only four bytes because of
>>>>>> the limits of R's integer type isn't all that helpful. Perhaps a
>>>>>> warning should be put out if the cast will affect the 

>> value of the
>>>>>> result? It looks like the relevant lines in 

>> src/main/connections.c are
>>>>>> 3689-3697 in the current alpha:
>>>>>> 
>>>>>> #if SIZEOF_LONG == 8
>>>>>>                  case sizeof(long):
>>>>>>                      INTEGER(ans)[i] = (int)*((long *)buf);
>>>>>>                      break;
>>>>>> #elif SIZEOF_LONG_LONG == 8
>>>>>>                  case sizeof(_lli_t):
>>>>>>                      INTEGER(ans)[i] = (int)*((_lli_t *)buf);
>>>>>>                      break;
>>>>>> #endif
>>>>>> 
>>>>>>>> ) The value can be represented as a double,
>>>>>>>> though:
>>>>>>>> 
>>>>>>>>> 4294967296
>>>>>>>> [1] 4294967296
>>>>>>>> 
>>>>>>>> I wouldn't expect readBin() to return a double if an 

>> integer was
>>>>>>>> requested, but is there any way to get the correct 

>> value out of it?
>>>>>>> 
>>>>>>> Trivially (for your unsigned big-endian case):
>>>>>>> 
>>>>>>> y<- readBin(x, "integer", n=length(x)/4L, endian="big")
>>>>>>> y<- ifelse(y<  0, 2^32 + y, y)
>>>>>>> i<- seq(1,length(y),2)
>>>>>>> y<- y[i] * 2^32 + y[i + 1L]
>>>>>> 
>>>>>> Thanks for the code, but I'm not sure I would call that trivial,
>>>>>> especially if one needs to cater for little endian and 

>> signed cases as
>>>>>> well!
>>>>> 
>>>>> I was saying for your case and it's trivial as in read as 

>> integers, convert to double precision and add.
>>>>> 
>>>>> 
>>>>>> This is what I meant by reconstructing the number manually...
>>>>>> 
>>>>> 
>>>>> You didn't say so - you were talking about reconstructing 

>> it from a raw vector which seems a lot more painful since you
>> can't compute with enough precision on raw vectors.
>>>> 
>>>> True - I should have been more specific. Sorry.
>>>> 
>>>> Jon
>>>> 
>>>> ______________________________________________
>>>> R-devel_at_r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>>> 

>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>

>
>

R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 30 Mar 2011 - 18:47:23 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 13 Apr 2011 - 18:00:48 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive