Re: [Rd] Reading 64-bit integers

From: Henrik Bengtsson <hb_at_biostat.ucsf.edu>
Date: Wed, 30 Mar 2011 20:50:32 -0700

On Wed, Mar 30, 2011 at 7:51 PM, Henrik Bengtsson <hb_at_biostat.ucsf.edu> wrote:
> On Wed, Mar 30, 2011 at 11:22 AM, Simon Urbanek
> <simon.urbanek_at_r-project.org> wrote:
>> Bill,
>>
>> thanks. I like that idea of the output parameter better, especially if we ever add different scalar vector types. Admittedly, what=integer() is the most useful case. What I was worried about is things like what=double(), output=integer() which could be legal, but are more conveniently dealt with via as.integer(readBin()) instead.
>
> What about this:
>
> Let the default be output=what.  Then, just throw an error upon the
> function for non-supported combinations of 'what' and 'output'.
> Something like (assuming 'what' and 'output' already have been
> converted to "type" strings):
>
> # Validate argument 'output':
> if (output != what) {
>  # In most cases, we never get here.
>  also <- list(integer="double")[[what]];
>  if (is.null(also) || !is.element(output, also)) {

if (!is.element(output, also)) {

should be enough.

/H

>    # Throw an informative error message
>    stop("Unsupported value of argument 'output' (\"", output, "\").
> Supported output types when reading \"", what, "\" values: ",
> paste(c(what, also), collapse=", "));
>  }
> }
>
> That should prevent any unintended usage (before wasting time with
> I/O).  It is also allows for future extension.
>
> Thxs
>
> /Henrik
>
>> I won't have more time today, but I'll have a look tomorrow.
>>
>> Thanks,
>> Simon
>>
>>
>> On Mar 30, 2011, at 1:38 PM, William Dunlap wrote:
>>
>>>
>>>> -----Original Message-----
>>>> From: r-devel-bounces_at_r-project.org
>>>> [mailto:r-devel-bounces_at_r-project.org] On Behalf Of Simon Urbanek
>>>> Sent: Tuesday, March 29, 2011 6:49 PM
>>>> To: Duncan Murdoch
>>>> Cc: r-devel_at_r-project.org
>>>> Subject: Re: [Rd] Reading 64-bit integers
>>>>
>>>>
>>>> On Mar 29, 2011, at 8:47 PM, Duncan Murdoch wrote:
>>>>
>>>>> On 29/03/2011 7:01 PM, Jon Clayden wrote:
>>>>>> Dear Simon,
>>>>>>
>>>>>> On 29 March 2011 22:40, Simon
>>>> Urbanek<simon.urbanek_at_r-project.org>  wrote:
>>>>>>> Jon,
>>>>>>>
>>>>>>> On Mar 29, 2011, at 1:33 PM, Jon Clayden wrote:
>>>>>>>
>>>>>>>> Dear Simon,
>>>>>>>>
>>>>>>>> Thank you for the response.
>>>>>>>>
>>>>>>>> On 29 March 2011 15:06, Simon
>>>> Urbanek<simon.urbanek_at_r-project.org>  wrote:
>>>>>>>>>
>>>>>>>>> On Mar 29, 2011, at 8:46 AM, Jon Clayden wrote:
>>>>>>>>>
>>>>>>>>>> Dear all,
>>>>>>>>>>
>>>>>>>>>> I see from some previous threads that support for
>>>> 64-bit integers in R
>>>>>>>>>> may be an aim for future versions, but in the meantime
>>>> I'm wondering
>>>>>>>>>> whether it is possible to read in integers of greater
>>>> than 32 bits at
>>>>>>>>>> all. Judging from ?readBin, it should be possible to
>>>> read 8-byte
>>>>>>>>>> integers to some degree, but it is clearly limited in
>>>> practice by R's
>>>>>>>>>> internally 32-bit integer type:
>>>>>>>>>>
>>>>>>>>>>> x<- as.raw(c(0,0,0,0,1,0,0,0))
>>>>>>>>>>> (readBin(x,"integer",n=1,size=8,signed=F,endian="big"))
>>>>>>>>>> [1] 16777216
>>>>>>>>>>> x<- as.raw(c(0,0,0,1,0,0,0,0))
>>>>>>>>>>> (readBin(x,"integer",n=1,size=8,signed=F,endian="big"))
>>>>>>>>>> [1] 0
>>>>>>>>>>
>>>>>>>>>> For values that fit into 32 bits it works fine, but
>>>> for larger values
>>>>>>>>>> it fails. (I'm a bit surprised by the zero - should
>>>> the value not be
>>>>>>>>>> NA if it is out of range?
>>>>>>>>>
>>>>>>>>> No, it's not out of range - int is only 4 bytes so only
>>>> 4 first bytes (respecting endianness order, hence LSB) are used.
>>>>>>>>
>>>>>>>> The fact remains that I ask for the value of an 8-byte
>>>> integer and
>>>>>>>> don't get it.
>>>>>>>
>>>>>>> I think you're misinterpreting the documentation:
>>>>>>>
>>>>>>>    If 'size' is specified and not the natural size of the object,
>>>>>>>    each element of the vector is coerced to an appropriate type
>>>>>>>    before being written or as it is read.
>>>>>>>
>>>>>>> The "integer" object type is defined as signed 32-bit in
>>>> R, so if you ask for "8 bytes into object type integer", you
>>>> get a coercion into that object type -- 32-bit signed integer
>>>> -- as documented. I think the issue may come from the
>>>> confusion of the object type "integer" with general "integer
>>>> number" in mathematical sense that has no representation
>>>> restrictions. (FWIW in C the "integer" type is "int" and it
>>>> is 32-bit on all modern OSes regardless of platform - that's
>>>> where the limitation comes from, it's not something R has made up).
>>>>>>
>>>>>> OK, but it still seems like there is a case for raising a
>>>> warning. As
>>>>>> it is there is no way to tell when reading an 8-byte integer from a
>>>>>> file whether its value is really 0, or if it merely has 0 in its
>>>>>> least-significant 4 bytes. If 99% of such stored numbers are below
>>>>>> 2^31, one is going to need some extra logic to catch the other 1%
>>>>>> where you (silently) get the wrong value. In essence, unless you're
>>>>>> certain that you will never come across a number that actually uses
>>>>>> the upper 4 bytes, you will always have to read it as two 4-byte
>>>>>> numbers and check that the high-order one (which is endianness
>>>>>> dependent, of course) is zero. A C-level sanity check seems more
>>>>>> efficient and more helpful to me.
>>>>>
>>>>> Seems to me that the S-PLUS solution (output="double")
>>>> would be a lot more useful.  I'd commit that if you write it;
>>>> I don't think I'd commit the warning.
>>>>>
>>>>
>>>> I was going to write some thing similar (idea = good, patch
>>>> welcome ;)). My only worry is that the "output" argument is a
>>>> bit misleading in that one could expect to use any
>>>> combination of "input"/"output" which may be a maintenance
>>>> nightmare. If I understand it correctly it's only a special
>>>> case for integer input. I don't have S+ so can't say how they
>>>> deal with that.
>>>
>>> In S+'s readBin the output argument can be
>>> only double() or single() when what is double()
>>> or single() (S+ still  has a real single
>>> precision storage mode) and can be any
>>> numeric type or logical when what is integer().
>>>
>>> The output=double() seemed like the only useful case.
>>>
>>> It does not warn when precision is lost in the 8-byte
>>> integer to double conversion.  Perhaps it should.
>>>
>>> Bill Dunlap
>>> Spotfire, TIBCO Software
>>> wdunlap tibco.com
>>>
>>>>
>>>> Cheers,
>>>> Simon
>>>>
>>>>
>>>>>
>>>>>>
>>>>>>>> Pretending that it's really only four bytes because of
>>>>>>>> the limits of R's integer type isn't all that helpful. Perhaps a
>>>>>>>> warning should be put out if the cast will affect the
>>>> value of the
>>>>>>>> result? It looks like the relevant lines in
>>>> src/main/connections.c are
>>>>>>>> 3689-3697 in the current alpha:
>>>>>>>>
>>>>>>>> #if SIZEOF_LONG == 8
>>>>>>>>                  case sizeof(long):
>>>>>>>>                      INTEGER(ans)[i] = (int)*((long *)buf);
>>>>>>>>                      break;
>>>>>>>> #elif SIZEOF_LONG_LONG == 8
>>>>>>>>                  case sizeof(_lli_t):
>>>>>>>>                      INTEGER(ans)[i] = (int)*((_lli_t *)buf);
>>>>>>>>                      break;
>>>>>>>> #endif
>>>>>>>>
>>>>>>>>>> ) The value can be represented as a double,
>>>>>>>>>> though:
>>>>>>>>>>
>>>>>>>>>>> 4294967296
>>>>>>>>>> [1] 4294967296
>>>>>>>>>>
>>>>>>>>>> I wouldn't expect readBin() to return a double if an
>>>> integer was
>>>>>>>>>> requested, but is there any way to get the correct
>>>> value out of it?
>>>>>>>>>
>>>>>>>>> Trivially (for your unsigned big-endian case):
>>>>>>>>>
>>>>>>>>> y<- readBin(x, "integer", n=length(x)/4L, endian="big")
>>>>>>>>> y<- ifelse(y<  0, 2^32 + y, y)
>>>>>>>>> i<- seq(1,length(y),2)
>>>>>>>>> y<- y[i] * 2^32 + y[i + 1L]
>>>>>>>>
>>>>>>>> Thanks for the code, but I'm not sure I would call that trivial,
>>>>>>>> especially if one needs to cater for little endian and
>>>> signed cases as
>>>>>>>> well!
>>>>>>>
>>>>>>> I was saying for your case and it's trivial as in read as
>>>> integers, convert to double precision and add.
>>>>>>>
>>>>>>>
>>>>>>>> This is what I meant by reconstructing the number manually...
>>>>>>>>
>>>>>>>
>>>>>>> You didn't say so - you were talking about reconstructing
>>>> it from a raw vector which seems a lot more painful since you
>>>> can't compute with enough precision on raw vectors.
>>>>>>
>>>>>> True - I should have been more specific. Sorry.
>>>>>>
>>>>>> Jon
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-devel_at_r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>
>>>>>
>>>>
>>>> ______________________________________________
>>>> R-devel_at_r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>
>>>
>>>
>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu 31 Mar 2011 - 03:59:12 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 31 Mar 2011 - 15:40:38 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive