Re: [R] Relational Databases or XML?

From: Martin Morgan <>
Date: Thu, 10 Apr 2008 14:31:00 -0700

Harold -- you'll really want to check out the XML package. xmlTreeParse + xpathApply provides a very flexible solution. As a recent example, parsing 189 XML files to extract 4 attributes from deeply nested elements into a data frame:

fls <- list.files('~/runBrowser', pattern=".*xml", full=TRUE) f <- function(fl) {

     xq <- function(xml, q)
         unlist(xpathApply(xml, q, xmlValue, namespaces="xsi"))
     xml <- xmlTreeParse(fl, useInternal=TRUE)
     data.frame(idx=rep(as.numeric(xq(xml, "//xsi:tile/@idx")), each=4),
         lane=rep(as.numeric(xq(xml, "//xsi:tile/@lane")), each=4),
         base=xq(xml, '//xsi:image/@base'),
         medSigInt=as.numeric(xq(xml, "//xsi:sgnInt/@median")))
res <-'rbind', lapply(fls, f))

'res' has 54800 rows and 4 columns. The XML stays in C, so this is fast. The data can be effectively (your mileage may vary) visualized with lattice, e.g.,

xyplot(log(medSigInt)~idx|lane*base, res, strip=FALSE, pch=".", cex=2)


Doran, Harold wrote:

> I'm not sure it is possible to parse an XML file in R directly. Well, I
> guess it's *possible*, but may not be the best way to do it. ElementTree
> in Python is an easy-to-use parser that you might use to first parse
> your XML file (or others hierarchically structured data), organize it
> anyway you want, and then bring those data into R for subsequent
> analysis.
> In fact, I have recently done just this. I have another statistical
> program that outputs data as an XML file. So, I wrote a python program
> that parses that XML file, pulls out the data of interest into a text
> file, and then I bring those data into R for analysis.

>> -----Original Message-----
>> From:
>> [] On Behalf Of Keith Alan
>> Chamberlain
>> Sent: Thursday, April 10, 2008 4:14 PM
>> To:
>> Subject: [R] Relational Databases or XML?
>> Dear R-Help,
>> I am working on a paper in an R course for large file support
>> in R using scan(), relational databases, and XML. I have
>> never used SQL or heirarchical document formats such as XML
>> (except where it occurs without user interaction), and
>> knowledge in RDBs and XML is lacking in my program. I have
>> tried finding a working example for the novices-novice on the
>> topic, read many postings, the r-data I/O manual several
>> times, and descriptions of packages RODBC, DBI, XML, among
>> others. I understand that RDBs are (assumed at least) used
>> widely among the R community. I have not been able to put all
>> of the pieces together, but assuming that RDB use is actually
>> quite widespread, it should be quite easy to fill me in
>> and/or correct my understanding where necessary.
>> For a cross-platform solution (PC/OSX at least, or in part)
>> my questions/problems are about what preliminary steps are
>> needed to get an SQL or XML query "to work" in R to begin
>> with, what the appropriate data-file formats are, and how to
>> convert to them if starting out with data in, say, a
>> delimited ASCII text file. Very basic examples should
>> suffice, say, a table with 20 random observations, a grouping
>> variable with 2 levels, and a factor with 2 levels.
>> ## untested code
>> set.seed(1024)
>> write.table("junk.txt",
>> data.frame(Subj=c(rep(1,10),rep(2,10)),block=rep(c(rep(-1,5),r
>> ep(1,5)),2), obs=rnorm(20,0,1)))
>> Specifically,
>> 1- what are the minimum required non R components that are
>> needed to support SQL or XML functionality, which may or may
>> not need to be installed?
>> 2- what R packages need to be installed, at a minimum (also
>> as a cross-PC/Mac solution if possible or at least as much as
>> possible)
>> 3- I keep seeing reference to connections of a given name "if
>> previously setup". What kind of setup is needed outside of R, if any?
>> 4- what steps are needed in R to then connect to a file and
>> import a subset based on a query?
>> 5- Do I then use standard R routines (e.g. write()) to export
>> as a DB, or an RDB/XML specific function?
>> Sincerely,
>> KeithC. [U.S]
>> 1/k^c
>> ______________________________________________
>> mailing list
>> PLEASE do read the posting guide
>> and provide commented, minimal, self-contained, reproducible code.
> ______________________________________________
> mailing list
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.

Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

______________________________________________ mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.
Received on Thu 10 Apr 2008 - 21:36:29 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 11 Apr 2008 - 13:30:28 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.

list of date sections of archive