Re: [Rd] seek() and gzfile() on 32-bit R2.12.0 in linux

From: Matt Shotwell <shotwelm_at_musc.edu>
Date: Tue, 22 Jun 2010 23:41:24 -0400

I was able to reproduce this bug. After some investigating, it's clearly localized to gztell (a zlib function), and the z_off_t type. However, there may be a broader cross-compiling problem. I don't know what procedure Brandon used to compile the 32 bit version (I used the gcc -m32 flag), but we should be sure that we're doing this correctly (and document it!) before going on a goose chase. The real issue may or may not be related to zlib, but only manifested there. Discussion of my findings are below.

-Matt

I checked to ensure that R's file function was recognizing the gzip file as such. So that's not the problem. I next modified some code in gzfile_seek, just above and below the call to gztell (line 1230 of connections.c), and defined a small function z_off_t_print, to print the bits of the z_off_t offset in least significant order (assuming little endian):

static void z_off_t_print(z_off_t)
{

    z_off_t mask = 1;
    while( mask > 0 ) {

        printf("%u", (mask & u) > 0 ); 
        mask <<= 1;

    }
    printf("\n");
}

static double gzfile_seek(Rconnection con, double where, int origin, int rw) {

    gzFile fp = ((Rgzfileconn)(con->private))->fp;

    /** begin modified code **/
    z_off_t pos;

    printf("sizeof(z_off_t): %u\n", sizeof(z_off_t));
    printf("sizeof(double): %u\n", sizeof(double));
    printf("before gztell():\n");

    z_off_t_print(pos);
    pos = gztell(fp);
    printf("after gztell():\n");
    z_off_t_print(pos);
    printf("(double) pos: %f\n", (double) pos);

    /** end modified code **/
    ...

Here's what happens running code similar to yours in the 32 bit build:

> zz <- gzfile("ex.gz", "w") # compressed file
> cat("TITLE extra line", "2 3 5 7",

+ "", "11 13 17", file = zz, sep = "\n")
> close(zz)
> blah = file("ex.gz", "r")
> seek(blah, 5)

sizeof(z_off_t): 8
sizeof(double): 8
before gztell():

000000000000000000000000000000000000000000000000000000000000000
after gztell():
000000000000000000000000000000000000110000111011110111001001000
(double) pos: 665367468683821056.000000
[1] 6.653675e+17
> seek(blah)

before gztell():
000000000000000000000000000000000000000000000000000000000000000
after gztell():
101000000000000000000000000000000000110000111011110111001001000
(double) pos: 665367468683821056.000000
[1] 6.653675e+17

Hence, gztell is doing what we expect in the least significant 32 bits (which is binary for decimal 5), but returns junk in the most significant 32 bits. Here are the results for the 64 bit build:

> zz <- gzfile("ex.gz", "w") # compressed file
> cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
> close(zz)
> blah = file("ex.gz", "r")
> seek(blah, 5)

sizeof(z_off_t): 8
sizeof(double): 8
before gztell():

000000000000000000000000000000000000000000000000000000000000000
after gztell():
000000000000000000000000000000000000000000000000000000000000000
(double) pos: 0.000000
[1] 0
> seek(blah)

before gztell():
000000000000000000000000000000000000000000000000000000000000000
after gztell():
101000000000000000000000000000000000000000000000000000000000000
(double) pos: 5.000000
[1] 5

No problems with the 64 bit build.

On Tue, 2010-06-22 at 13:04 -0400, Brandon Whitcher wrote:
> I have installed both 32-bit and 64-bit versions of R2.12.0 (2010-06-15
> r52300) on my Ubuntu 10.04 64-bit system. I observe the following behavior
> when running the examples from base::connections. There appears to be a
> problem with seek() on a .gz file when using a 32-bit installation of
> R2.12.0, but the problem doesn't appear in the 64-bit installation. I
> realize that seek() has been difficult in the past, and I don't want to open
> old wounds, but is this a known problem? Is this easily fixable? I have a
> package that relies on seek() when accessing gzipped files.
>
> Using the 32-bit installation...
>
> *> zz <- file("ex.data", "w") # open an output file connection
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep =
> "\n")
> > cat("One more line\n", file = zz)
> > close(zz)
> > blah = file("ex.data", "r")
> > seek(blah)
> [1] 0
> >
> > zz <- gzfile("ex.gz", "w") # compressed file
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep =
> "\n")
> > close(zz)
> > blah = file("ex.gz", "r")
> > seek(blah)
> [1] 7.80707e+17
> >
> > zz <- bzfile("ex.bz2", "w") # bzip2-ed file
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep =
> "\n")
> > close(zz)
> > blah = file("ex.bz2", "r")
> > seek(blah)
> Error in seek.connection(blah) : 'seek' not enabled for this connection
> >*
>
> Using the 64-bit installation...
>
> *> zz <- file("ex.data", "w") # open an output file connection
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
> > cat("One more line\n", file = zz)
> > close(zz)
> > blah = file("ex.data", "r")
> > seek(blah)
> [1] 0
> >
> > zz <- gzfile("ex.gz", "w") # compressed file
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
> > close(zz)
> > blah = file("ex.gz", "r")
> > seek(blah)
> [1] 0
> >
> > zz <- bzfile("ex.bz2", "w") # bzip2-ed file
> > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n")
> > close(zz)
> > blah = file("ex.bz2", "r")
> > seek(blah)
> Error in seek.connection(blah) : 'seek' not enabled for this connection
> > *
>
> thanks,
>
> Brandon
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Matthew S. Shotwell
Graduate Student
Division of Biostatistics and Epidemiology
Medical University of South Carolina
http://biostatmatt.com

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Wed 23 Jun 2010 - 03:43:57 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 23 Jun 2010 - 07:21:16 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive