Re: [R] [Possible SPAM] Reading selected lines in an .html file

From: Martin Morgan <mtmorgan_at_fhcrc.org>
Date: Thu, 05 Jun 2008 14:07:52 -0700

Staying in R, the XML package in conjunction with the XPATH query language is likely to be your friend.

> library(XML)
> html=htmlTreeParse("http://www.wunderground.com/global/stations/16239.html", useInternal=TRUE)
> xpathApply(html, "//span[@pwsvariable='tempf' and
+ @pwsid='LIRA']/@value", xmlValue)
[[1]]
[1] "63"

see http://www.w3.org/TR/xpath especially http://www.w3.org/TR/xpath#path-abbrev for xpath hints.

Martin

Daniel Folkinshteyn <dfolkins_at_gmail.com> writes:

> i know this is an R mailing list :) but... i'll recommend you try
> python with the beautifulsoup module - makes html processing a cinch.
>
> another thing to note is that wunderground provides very handy RSS
> feeds for every location, so rather than parsing the html page (with
> it's associated bundles of gunk), you'd have a better time parsing the
> RSS feed. (there are some rss parsing libraries for python, too, but
> in your simple case it may be simpler to just extract stuff manually
> with some well-placed regexps)
>
> so use python to pull that out, and append to a nice tab-delimited
> file, and then in your R process just read from that file.
>
> on 06/05/2008 04:45 PM Nutter, Benjamin said the following:
>> I've tried to tackle a similar question at the request of a coworker.
>> Unfortunately, it is difficult to read in HTML code because it lacks
>> character that can consistently be used as a delimiter. The only
>> guideline I can offer is that any text you're interested in is going to
>> be between a ">" and a "<". So the goal is to eliminate anything
>> between < and >.
>> What's more, if you really want to read in HTML code, you'll need a
>> good
>> grasp on HTML itself, and some familiarity with how the code you're
>> reading in is structured. For instance, I'm attaching code that I wrote
>> to read in HTML tables that were generated by other functions commonly
>> used in my work place. But my code assumes that the tables are written
>> by row (using the <tr> tag.
>> Essentially, after studying the code I was going to read in, I hand
>> picked the markers that I could use to isolate the text I wanted. I
>> then proceeded to play a game of Simon Says to break down the code to
>> smaller and smaller pieces until I got what I wanted. Unless you're
>> going to be doing this a lot, I wouldn't recommend taking
>> the time to try and write a function like this. In most cases it's
>> probably faster just to copy the data by hand. But if you are
>> determined to make it work, I hope the ideas help.
>> Benjamin
>> -----Original Message-----
>> From: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-project.org]
>> On Behalf Of vittorio
>> Sent: Wednesday, June 04, 2008 3:50 PM
>> To: r-help_at_stat.math.ethz.ch
>> Subject: [Possible SPAM] [R] Reading selected lines in an .html file
>> Dear friend, In an R program running permanently on a server I would
>> like to read
>> hour by hour the temperature in *C and the humidity from a site
>> like this
>> (actually, from many of such sites):
>> http://www.wunderground.com/global/stations/16239.html
>> How can I read the content of the site and select the info I need?
>> Ciao
>> Vittorio
>> ______________________________________________
>> R-help_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> ===================================
>> P Please consider the environment before printing this e-mail
>> Cleveland Clinic is ranked one of the top hospitals
>> in America by U.S. News & World Report (2007). Visit us online at
>> http://www.clevelandclinic.org for
>> a complete listing of our services, staff and
>> locations.
>> Confidentiality Note: This message is intended for use
>> only by the individual or entity to which it is addressed
>> and may contain information that is privileged,
>> confidential, and exempt from disclosure under applicable
>> law. If the reader of this message is not the intended
>> recipient or the employee or agent responsible for
>> delivering the message to the intended recipient, you are
>> hereby notified that any dissemination, distribution or
>> copying of this communication is strictly prohibited. If
>> you have received this communication in error, please
>> contact the sender immediately and destroy the material in
>> its entirety, whether electronic or hard copy. Thank you.
>> ------------------------------------------------------------------------
>> ______________________________________________
>> R-help_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M2 B169
Phone: (206) 667-2793

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Thu 05 Jun 2008 - 22:26:01 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 05 Jun 2008 - 22:30:41 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive