Re: [R] Analyzing Publications from Pubmed via XML

From: David Winsemius <dwinsemius_at_comcast.net>
Date: Sun, 16 Dec 2007 19:53:49 +0000 (UTC)

On 15 Dec 2007, you wrote in gmane.comp.lang.r.general:

> If we can assume that the abstract is always the 4th paragraph then we
> can try something like this:
>
> library(XML)
> doc <-
> xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss
> _guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-", isURL = TRUE,
> useInternalNodes = TRUE, trim = TRUE)
>
> out <- cbind(
> Author = unlist(xpathApply(doc, "//author", xmlValue)),
> PMID = gsub(".*:", "", unlist(xpathApply(doc, "//guid",
> xmlValue))),
> Abstract = unlist(xpathApply(doc, "//description",
> function(x) {
> on.exit(free(doc2))
> doc2 <- htmlTreeParse(xmlValue(x)[[1]], asText = TRUE,
> useInternalNodes = TRUE, trim = TRUE)
> xpathApply(doc2, "//p[4]", xmlValue)
> }
> )))
> free(doc)
> substring(out, 1, 25) # display first 25 chars of each field
>
>
> The last line produces (it may look messed up in this email):
>
>> substring(out, 1, 25) # display it
> Author PMID Abstract

 [1,] " Goon P, Sonnex C, Jani P" "18046565" "Human papillomaviruses (H"
 [2,] " Rad MH, Alizadeh E, Ilkh" "17978930" "Recurrent laryngeal papil"
 [3,] " Lee LA, Cheng AJ, Fang T" "17975511" "OBJECTIVES:: Papillomas o"
 [4,] " Gerein V, Schmandt S, Ba" "17935912" "BACKGROUND: Human papillo"
snip
>
>

It looked beautifully regular in my newsreader. It is helpful to see an example showing the indexed access to nodes. It was also helpful to see the example of substring for column display. Thank you (for this and all of your other contributions.)

I find upon further browsing that the pmfetch access point is obsolete. Experimentation with the PubMed eFetch server access point results in fully xml-tagged results:

e.fetch.doc<- function (){

   fetch.stem <-

        "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"    src.mode <- "db=pubmed&retmode=xml&"
   request <- "id=11045395"
   doc<-xmlTreeParse(paste(fetch.stem,src.mode,request,sep=""),

                          isURL = TRUE, useInternalNodes = TRUE)
     }

# in the debugging phase I needed to set useInternalNodes = TRUE to see the tags. Never did find a way to "print" them when internal.

doc<-e.fetch.doc()
get.info<- function(doc){

         df<-cbind(
 	Abstract = unlist(xpathApply(doc, "//AbstractText", xmlValue)),
 	Journal =  unlist(xpathApply(doc, "//Title", xmlValue)),
 	Pmid =  unlist(xpathApply(doc, "//PMID", xmlValue))
                   )

   return(df)

   }

# this works
> substring(get.info(doc), 1, 25)

     Abstract                    Journal                     Pmid      
[1,] "We studied the prevalence" "Pediatric nephrology (Ber" "11045395"
-- 
David Winsemius

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Sun 16 Dec 2007 - 19:58:46 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 17 Dec 2007 - 01:30:18 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.