Re: [Rd] seek() and gzfile() on 32-bit R2.12.0 in linux

From: Matt Shotwell <shotwelm_at_musc.edu>
Date: Tue, 22 Jun 2010 14:44:51 -0400

You used file to open "ex.gz", which ought to work, but relies on do_url to automatically detect that the file is a gzip file. It's a long shot, but you could try to verify that the file is a valid gzip file (R checks that the first two bytes == "\x1f\x8b") and try the gzfile function on the 32 bit machine and see what happens. Also, it would be nice to see the output of your sessionInfo(), in order to reproduce your finding.

This might be a bug in the R source:
(1 - unlikely) The C function do_url (src/main/connections.c) fails to detect the gzip file on the 32 bit machine. Unfortunately, even if do_url does detect a gzip file, the class of the returned connection object is still marked c("file", "connection") rather than c("gzfile", "connection"), so there's no easy check for this. Even so, this doesn't explain why you get 7.80707e+17.

(2 - more likely) The zlib function gztell (declared: src/extra/zlib/zlib.h defined: src/extra/zlib/gzlib.c) returns z_off_t. The bug may relate to the size of z_off_t on the two different machines and/or casting z_off_t to double (which is done just before the value is returned by gzfile_seek, defined in src/main/connections.c). What a headache. Need to reproduce the bug to investigate this further.

I have been wondering why double was used in the prototype for the seek member of (struct Rconn), rather than an integer type. Presumably to solve problems such as this. I'll be very interested to see what the core team has to say here.

-Matt

On Tue, 2010-06-22 at 13:04 -0400, Brandon Whitcher wrote:
> I have installed both 32-bit and 64-bit versions of R2.12.0 (2010-06-15
> r52300) on my Ubuntu 10.04 64-bit system. I observe the following behavior
> when running the examples from base::connections. There appears to be a
> problem with seek() on a .gz file when using a 32-bit installation of
> R2.12.0, but the problem doesn't appear in the 64-bit installation. I
> realize that seek() has been difficult in the past, and I don't want to open
> old wounds, but is this a known problem? Is this easily fixable? I have a
> package that relies on seek() when accessing gzipped files.
>
> Using the 32-bit installation...
>
> *> zz <- file("ex.data", "w") # open an output file connection
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep =
> "\n")
> > cat("One more line\n", file = zz)
> > close(zz)
> > blah = file("ex.data", "r")
> > seek(blah)
> [1] 0
> >
> > zz <- gzfile("ex.gz", "w") # compressed file
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep =
> "\n")
> > close(zz)
> > blah = file("ex.gz", "r")
> > seek(blah)
> [1] 7.80707e+17
> >
> > zz <- bzfile("ex.bz2", "w") # bzip2-ed file
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep =
> "\n")
> > close(zz)
> > blah = file("ex.bz2", "r")
> > seek(blah)
> Error in seek.connection(blah) : 'seek' not enabled for this connection
> >*
>
> Using the 64-bit installation...
>
> *> zz <- file("ex.data", "w") # open an output file connection
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
> > cat("One more line\n", file = zz)
> > close(zz)
> > blah = file("ex.data", "r")
> > seek(blah)
> [1] 0
> >
> > zz <- gzfile("ex.gz", "w") # compressed file
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
> > close(zz)
> > blah = file("ex.gz", "r")
> > seek(blah)
> [1] 0
> >
> > zz <- bzfile("ex.bz2", "w") # bzip2-ed file
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
> > close(zz)
> > blah = file("ex.bz2", "r")
> > seek(blah)
> Error in seek.connection(blah) : 'seek' not enabled for this connection
> > *
>
> thanks,
>
> Brandon
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Matthew S. Shotwell
Graduate Student
Division of Biostatistics and Epidemiology
Medical University of South Carolina
http://biostatmatt.com

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Tue 22 Jun 2010 - 18:57:38 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 23 Jun 2010 - 03:51:11 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive