[Rd] warning for inefficiently compressed datasets

From: Hervé Pagès <hpages_at_fhcrc.org>
Date: Tue, 06 Dec 2011 14:28:52 -0800


Hi,

Recently added to doc/NEWS.Rd:

   'R CMD check' now gives a warning rather than a note if it finds    inefficiently compressed datasets. With 'bzip2' and 'xz' compression    having been available since R 2.10.0, there is no excuse for not    using them.

Why isn't a note enough for this?

Generally speaking, warnings are for things that are dangerous, or unsafe, or unportable, or for anything that could potentially cause trouble. I don't see how using gzip instead of bzip2 or xz could fall into that category (and BTW gzip is the default for save() and for 'R CMD build' resave-data feature).

The problem is that bzip2 and xz compressions are slower and also require more memory than gzip. Bioconductor has big data packages and sometimes it makes sense to use gzip and not bzip2 or xz. For example, when loading Human chromosome 1 from disk, bzip2 and xz are 7 and 3.4 times slower than gzip, respectively:

> system.time(load("chr1-gzip.rda"))

      user  system elapsed
     1.210   0.180   1.384


> system.time(load("chr1-bzip2.rda"))
user system elapsed 9.500 0.160 9.674
> system.time(load("chr1-xz.rda"))
user system elapsed 4.46 0.20 4.69

hpages_at_latitude:~/testing$ ls -lhtr chr1-*.rda

-rw-r--r-- 1 hpages hpages 61M 2011-12-06 12:13 chr1-gzip.rda
-rw-r--r-- 1 hpages hpages 55M 2011-12-06 12:15 chr1-bzip2.rda
-rw-r--r-- 1 hpages hpages 49M 2011-12-06 12:25 chr1-xz.rda

This is with R-2.14.0 on a 64-bit Ubuntu laptop with 8GB of RAM.

The size on disk doesn't really matter and it doesn't matter either that the source tarball for the full Human genome ends up being 20% bigger when using gzip instead of xz: the 20% extra time it takes to download it (which needs to be done only once) will largely be compensated by the fact that most analyses will run faster e.g. in 40-45 sec. instead of more than 2 minutes (for many short analyses, loading the chromosomes into memory is the bottleneck).

Is there a way to turn this warning off? If not, could an option be added to 'R CMD check' to turn this warning off? Something along the lines of the --no-resave-data option for 'R CMD build'.

Thanks,
H.

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages_at_fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Tue 06 Dec 2011 - 22:31:29 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 07 Dec 2011 - 08:50:15 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive