Re: [Rd] RData File Specification?

From: Simon Urbanek <simon.urbanek_at_r-project.org>
Date: Fri, 24 Aug 2007 22:07:18 -0400

On Aug 24, 2007, at 2:06 PM, Hin-Tak Leung wrote:

> I was going to write 'Use the source, Luke', but it seems that you
> have
> alreday found the relevant source files. I wrote a Python baed Rdata
> writer and a reader sometimes ago just using that info and I am not
> away of any file spec, so I know those two files are sufficient. For
> what you want to do, I think you'll have to write some fairly
> substantial code to process the Rdata as just XDR stream (as my python
> scripts do, using the python built-in xdrlib),

Unfortunately the format is not true XDR (it is not padded properly - CHARs (incl. symbols etc.) and raw vectors violate the padding rules), so you have to fall back to low-level access for some parts. It effect, the only part of XDR used is the storage of int and double (which is quite trivial), so IMHO any language (even without XDR) will do ...

Cheers,
Simon

> because as far as I know the API you are after is not exposed - you'll
> have to - and you can - cut and paste a substantial part of saveload.c
> and serialize.c for that matter, of course.
>
> I think my python-based Rdata reader would do most of what you want
> (it was written for mostly diagnostic purposes as I was 'hand-
> crafting'
> R objects in C and saving them as Rdata then read it tell me what's
> wrong with them, if any) except it dumps a sort of general human
> readable ascii text format rather than csv...
>
> My sugegstion would be to use a lanaguage you are comfortable with
> which
> comes with an xdr library, and just do it by hand...
>
> Cook, Ian wrote:
>> Hi,
>>
>> I am developing a tool for converting a large data frame stored in
>> an uncompressed binary (XDR) RData file to a delimited text file.
>> The data frame is too large to load() and extract rows from on a
>> typical PC. I'm looking to parse through the file and extract
>> individual entries without loading the whole thing into memory.
>>
>> In terms of some C source functions, instead of doing RestoreToEnv
>> (R_Unserialize(connection)) which is essentially what load() does,
>> I'm looking to get the documentation I would need to build a
>> function "SaveToCSV()" so that I could do SaveToCSV(R_Unserialize
>> (connection)).
>>
>> Where can I get documentation on the RData file format? Does a
>> spec document exist?
>>
>> See details below.
>>
>> Thanks,
>> Ian
>>
>> Ian Cook | Advanced Micro Devices, Inc. | ian.cook_at_amd.com
>>
>> -------------------------
>>
>> Additional details:
>>
>> I've browsed through the relevant source code (saveload.c,
>> serialize.c) for ideas.
>>
>> Here's a demo of the problem I'm looking to solve:
>>
>> # create a sample data frame
>> ds <- data.frame(row1=c(1,2,3),row2=c('a','b','c'))
>> # save into an uncompressed binary R dataset
>> save(ds,file="ds.rdata",compress=FALSE)
>> rm(ds)
>>
>> # Then load() can be simulated like this:
>>
>> # create and open a file connection
>> con <- file("ds.rdata",open="rb")
>> # read the first 5 characters
>> readChar(con,5)
>> # unserialize the remainder and restore to the environment
>> ds <- unserialize(con,NULL)[["ds"]]
>> close(con)
>>
>> But this takes up too much memory if the data set is too big. I
>> can read in the file character-by-character, i.e. using readChar
>> (), but it's obvious that the file format is not trivial. readChar
>> (con,10000) for this demo yields:
>>
>> RDX2\nX\n\0\0\0\002\0\002\004\001\0\002\003\0\0\0\004\002\0\0\0\001
>> \0\0\020\t\0\0\0\002ds\0\0\003\023\0\0\0\002\0\0\0\016\0\0\0\003?
>> \0\0\0\0\0\0@\0\0\0\0\0\0\0@\b\0\0\0\0\0\0\0\0\003\r\0\0\0\003\0\0
>> \0\001\0\0\0\002\0\0\0\003\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0
>> \006levels\0\0\0\020\0\0\0\003\0\0\0\t\0\0\0\001a\0\0\0\t\0\0\0
>> \001b\0\0\0\t\0\0\0\001c\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0
>> \005class\0\0\0\020\0\0\0\001\0\0\0\t\0\0\0\006factor\0\0\0\0\0
>> \004\002\0\0\0\001\0\0\020\t\0\0\0\005names\0\0\0\020\0\0\0\002\0\0
>> \0\t\0\0\0\004row1\0\0\0\t\0\0\0\004row2\0\0\004\002\0\0\0\001\0\0
>> \020\t\0\0\0\trow.names\0\0\0\r\0\0\0\002\0\0\0\0\0\0\003\0\0\004
>> \002\0\0\003\0\0\0\020\0\0\0\001\0\0\0\t\0\0\0\ndata.frame\0\0\0
>> \0\0\0
>>
>> This would be parse-able if I had a file spec. Thanks.
>>
>> Ian Cook | Advanced Micro Devices, Inc. | ian.cook_at_amd.com
>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat 25 Aug 2007 - 02:12:01 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 27 Aug 2007 - 13:39:47 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.