Re: [R] How to parse XML

From: Bos, Roger <roger.bos_at_us.rothschild.com>
Date: Fri, 02 May 2008 15:57:14 -0400

Martin,

I can't thank you enough for taking the time to help and providing the detailed examples of how to get started. Now I know exactly how to proceed.

Thanks again,

Roger

-----Original Message-----
From: Martin Morgan [mailto:mtmorgan_at_fhcrc.org] Sent: Friday, May 02, 2008 12:02 PM
To: Bos, Roger
Cc: r-help_at_r-project.org
Subject: Re: [R] How to parse XML

Hi Roger --

"Bos, Roger" <roger.bos_at_us.rothschild.com> writes:

> I would like to learn how to parse a mixed text/xml document I
> downloaded from the sec.gov website (see example below). I would like

I'm not sure of a more robust way to extract the XML, but from inspection I wrote

> ftp <-

"ftp://anonymous:test@ftp.sec.gov/edgar/data/1317493/0001144204-08-02122 1.txt"
> txt <- readLines(ftp)
> xmlInside <- grep("</*XML", txt)
> xmlTxt <- txt[seq(xmlInside[1]+1, xmlInside[2]-1)]

so that xmlTxt contains the part of the message that is XML

> to parse this to get the value for each xml tag and then access it
> within R, but I don't know much about xml so I don't even know where
> to

There are several ways to proceed. I personally like the xpath query language. to do this, one might

> xml <- xmlTreeParse(xmlTxt, useInternal=TRUE)
> head(unlist(xpathApply(xml, "//*", xmlName)))

[1] "ownershipDocument" "schemaVersion" "documentType"

[4] "periodOfReport" "notSubjectToSection16" "issuer"

xpathApply takes an xml document and performs a query. The query '//*' says find all nodes mataching any character string (that's the *) that are located anywhere (that's the //) below the current (in this case root) node. This gives a list of nodes; xmlName extracts the name of the node. If I wanted all nodes not subject to section 16 (sounds ominmous) I'd extract all the nodes (a list0

> node <- xpathApply(xml, "//notSubjectToSection16")

and then do something with them, e.g., look at them

> lapply(node, saveXML)
[[1]]
[1] "<notSubjectToSection16>0</notSubjectToSection16>"

(not so bad, looks like nothing is not subject to section 16, that's a relief) and extract their value

> lapply(node, xmlValue)

In one step:

> xpathApply(xml, "//notSubjectToSection16", xmlValue)

?xpathApply is a good starting place, as is http://www.w3.org/TR/xpath, especially

http://www.w3.org/TR/xpath#path-abbrev

Martin

> start debugging the errors I am getting in this example code. Can
> anyone help me get started?
>
> Thanks, Roger
>
> ftp <-
> "ftp://anonymous:test@ftp.sec.gov/edgar/data/1317493/0001144204-08-021
> 22
> 1.txt"
> download.file(url=ftp, destfile="test2.txt")
> xmlTreeParse("test2.txt")
>
>
> **********************************************************************

> * This message is for the named person's use only. It may contain
> confidential, proprietary or legally privileged information. No right
> to confidential or privileged treatment of this message is waived or
> lost by any error in transmission. If you have received this message
> in error, please immediately notify the sender by e-mail, delete the
> message and all copies from your system and destroy any hard copies.
> You must not, directly or indirectly, use, disclose, distribute, print

> or copy any part of this message if you are not the intended
> recipient.
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center 1100
Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

********************************************************************** * 
This message is for the named person's use only. It may 
contain confidential, proprietary or legally privileged 
information. No right to confidential or privileged treatment 
of this message is waived or lost by any error in 
transmission. If you have received this message in error, 
please immediately notify the sender by e-mail, 
delete the message and all copies from your system and destroy 
any hard copies. You must not, directly or indirectly, use, 
disclose, distribute, print or copy any part of this message 
if you are not the intended recipient. 

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Fri 02 May 2008 - 20:51:14 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 02 May 2008 - 21:30:47 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive