Re: [Rd] RData File Specification?

From: Paul Murrell <paul_at_stat.auckland.ac.nz>
Date: Fri, 24 Aug 2007 10:45:55 +0200

Hi

Cook, Ian wrote:
> Hi,
>
> I am developing a tool for converting a large data frame stored in an
> uncompressed binary (XDR) RData file to a delimited text file. The
> data frame is too large to load() and extract rows from on a typical
> PC. I'm looking to parse through the file and extract individual
> entries without loading the whole thing into memory.
>
> In terms of some C source functions, instead of doing
> RestoreToEnv(R_Unserialize(connection)) which is essentially what
> load() does, I'm looking to get the documentation I would need to
> build a function "SaveToCSV()" so that I could do
> SaveToCSV(R_Unserialize(connection)).
>
> Where can I get documentation on the RData file format? Does a spec
> document exist?
>
> See details below.
>
> Thanks, Ian
>
> Ian Cook | Advanced Micro Devices, Inc. | ian.cook_at_amd.com
>
> -------------------------
>
> Additional details:
>
> I've browsed through the relevant source code (saveload.c,
> serialize.c) for ideas.
>
> Here's a demo of the problem I'm looking to solve:
>
> # create a sample data frame ds <-
> data.frame(row1=c(1,2,3),row2=c('a','b','c')) # save into an
> uncompressed binary R dataset save(ds,file="ds.rdata",compress=FALSE)
> rm(ds)
>
> # Then load() can be simulated like this:
>
> # create and open a file connection con <- file("ds.rdata",open="rb")
> # read the first 5 characters readChar(con,5) # unserialize the
> remainder and restore to the environment ds <-
> unserialize(con,NULL)[["ds"]] close(con)
>
> But this takes up too much memory if the data set is too big. I can
> read in the file character-by-character, i.e. using readChar(), but
> it's obvious that the file format is not trivial.
> readChar(con,10000) for this demo yields:
>
> RDX2\nX\n\0\0\0\002\0\002\004\001\0\002\003\0\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\002ds\0\0\003\023\0\0\0\002\0\0\0\016\0\0\0\003?ð\0\0\0\0\0\0@\0\0\0\0\0\0\0@\b\0\0\0\0\0\0\0\0\003\r\0\0\0\003\0\0\0\001\0\0\0\002\0\0\0\003\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\006levels\0\0\0\020\0\0\0\003\0\0\0\t\0\0\0\001a\0\0\0\t\0\0\0\001b\0\0\0\t\0\0\0\001c\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\005class\0\0\0\020\0\0\0\001\0\0\0\t\0\0\0\006factor\0\0\0þ\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\005names\0\0\0\020\0\0\0\002\0\0\0\t\0\0\0\004row1\0\0\0\t\0\0\0\004row2\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\trow.names\0\0\0\r\0\0\0\002€\0\0\0\0\0\0\003\0\0\004\002\0\0\003ÿ\0\0\0\020\0\0\0\001\0\0\0\t\0\0\0\ndata.frame\0\0\0þ\0\0\0þ
>
>
> This would be parse-able if I had a file spec. Thanks.

See the "R Internals" manual
http://cran.r-project.org/doc/manuals/R-ints.html

You might also find page 5 of R News 7/1 useful for exploring the format http://cran.r-project.org/doc/Rnews/Rnews_2007-1.pdf

Paul

> Ian Cook | Advanced Micro Devices, Inc. | ian.cook_at_amd.com
>
> ______________________________________________ R-devel_at_r-project.org
> mailing list https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Dr Paul Murrell
Department of Statistics
The University of Auckland
Private Bag 92019
Auckland
New Zealand
64 9 3737599 x85392
paul_at_stat.auckland.ac.nz
http://www.stat.auckland.ac.nz/~paul/

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Fri 24 Aug 2007 - 14:12:58 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 24 Aug 2007 - 18:39:04 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.