Re: [Rd] warning for inefficiently compressed datasets

From: Uwe Ligges <ligges_at_statistik.tu-dortmund.de>
Date: Wed, 07 Dec 2011 09:34:08 +0100

On 06.12.2011 23:28, Hervé Pagès wrote:
> Hi,
>
> Recently added to doc/NEWS.Rd:
>
> 'R CMD check' now gives a warning rather than a note if it finds
> inefficiently compressed datasets. With 'bzip2' and 'xz' compression
> having been available since R 2.10.0, there is no excuse for not
> using them.
>
> Why isn't a note enough for this?
>
> Generally speaking, warnings are for things that are dangerous,
> or unsafe, or unportable, or for anything that could potentially
> cause trouble. I don't see how using gzip instead of bzip2 or xz
> could fall into that category (and BTW gzip is the default for
> save() and for 'R CMD build' resave-data feature).
>
> The problem is that bzip2 and xz compressions are slower and also
> require more memory than gzip. Bioconductor has big data packages
> and sometimes it makes sense to use gzip and not bzip2 or xz. For
> example, when loading Human chromosome 1 from disk, bzip2 and xz
> are 7 and 3.4 times slower than gzip, respectively:
>
> > system.time(load("chr1-gzip.rda"))
> user system elapsed
> 1.210 0.180 1.384
>
> > system.time(load("chr1-bzip2.rda"))
> user system elapsed
> 9.500 0.160 9.674
>
> > system.time(load("chr1-xz.rda"))
> user system elapsed
> 4.46 0.20 4.69
>
> hpages_at_latitude:~/testing$ ls -lhtr chr1-*.rda
> -rw-r--r-- 1 hpages hpages 61M 2011-12-06 12:13 chr1-gzip.rda
> -rw-r--r-- 1 hpages hpages 55M 2011-12-06 12:15 chr1-bzip2.rda
> -rw-r--r-- 1 hpages hpages 49M 2011-12-06 12:25 chr1-xz.rda
>
> This is with R-2.14.0 on a 64-bit Ubuntu laptop with 8GB of RAM.
>
> The size on disk doesn't really matter and it doesn't matter either
> that the source tarball for the full Human genome ends up being 20%
> bigger when using gzip instead of xz: the 20% extra time it takes to
> download it (which needs to be done only once) will largely be
> compensated by the fact that most analyses will run faster e.g. in
> 40-45 sec. instead of more than 2 minutes (for many short analyses,
> loading the chromosomes into memory is the bottleneck).

Oh, from a European side this 20% extra time may be an hour when downloading from the BioC master rather than a mirror. And space and traffic is an issue for CRAN.

> Is there a way to turn this warning off? If not, could an option be
> added to 'R CMD check' to turn this warning off? Something along the
> lines of the --no-resave-data option for 'R CMD build'.

The manual tells us:

"The following environment variables can be used to customize the operation of check: a convenient place to set these is the file ‘~/.R/check.Renviron’.

[...]

_R_CHECK_COMPACT_DATA2_ If true, check data for ascii and uncompressed saves, and also check if using bzip2 or xz compression would be significantly better. Implies _R_CHECK_COMPACT_DATA_ is true. Default: true."

Uwe

>
> Thanks,
> H.
>



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 07 Dec 2011 - 08:37:27 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 14 Dec 2011 - 00:20:18 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive