Re: [R] [Possible SPAM] Reading selected lines in an .html file

From: Daniel Folkinshteyn <dfolkins_at_gmail.com>
Date: Thu, 05 Jun 2008 16:57:26 -0400

i know this is an R mailing list :) but... i'll recommend you try python with the beautifulsoup module - makes html processing a cinch.

another thing to note is that wunderground provides very handy RSS feeds for every location, so rather than parsing the html page (with it's associated bundles of gunk), you'd have a better time parsing the RSS feed. (there are some rss parsing libraries for python, too, but in your simple case it may be simpler to just extract stuff manually with some well-placed regexps)

so use python to pull that out, and append to a nice tab-delimited file, and then in your R process just read from that file.

on 06/05/2008 04:45 PM Nutter, Benjamin said the following:
> I've tried to tackle a similar question at the request of a coworker.
> Unfortunately, it is difficult to read in HTML code because it lacks
> character that can consistently be used as a delimiter. The only
> guideline I can offer is that any text you're interested in is going to
> be between a ">" and a "<". So the goal is to eliminate anything
> between < and >.
>
> What's more, if you really want to read in HTML code, you'll need a good
> grasp on HTML itself, and some familiarity with how the code you're
> reading in is structured. For instance, I'm attaching code that I wrote
> to read in HTML tables that were generated by other functions commonly
> used in my work place. But my code assumes that the tables are written
> by row (using the <tr> tag.
>
> Essentially, after studying the code I was going to read in, I hand
> picked the markers that I could use to isolate the text I wanted. I
> then proceeded to play a game of Simon Says to break down the code to
> smaller and smaller pieces until I got what I wanted.
>
> Unless you're going to be doing this a lot, I wouldn't recommend taking
> the time to try and write a function like this. In most cases it's
> probably faster just to copy the data by hand. But if you are
> determined to make it work, I hope the ideas help.
>
> Benjamin
>
> -----Original Message-----
> From: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-project.org]
> On Behalf Of vittorio
> Sent: Wednesday, June 04, 2008 3:50 PM
> To: r-help_at_stat.math.ethz.ch
> Subject: [Possible SPAM] [R] Reading selected lines in an .html file
>
> Dear friend,
>
> In an R program running permanently on a server I would like to read
> hour by
> hour the temperature in *C and the humidity from a site like this
> (actually,
> from many of such sites):
>
> http://www.wunderground.com/global/stations/16239.html
>
> How can I read the content of the site and select the info I need?
>
> Ciao
> Vittorio
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
> ===================================
>
> P Please consider the environment before printing this e-mail
>
> Cleveland Clinic is ranked one of the top hospitals
> in America by U.S. News & World Report (2007).
> Visit us online at http://www.clevelandclinic.org for
> a complete listing of our services, staff and
> locations.
>
>
> Confidentiality Note: This message is intended for use
> only by the individual or entity to which it is addressed
> and may contain information that is privileged,
> confidential, and exempt from disclosure under applicable
> law. If the reader of this message is not the intended
> recipient or the employee or agent responsible for
> delivering the message to the intended recipient, you are
> hereby notified that any dissemination, distribution or
> copying of this communication is strictly prohibited. If
> you have received this communication in error, please
> contact the sender immediately and destroy the material in
> its entirety, whether electronic or hard copy. Thank you.
>
>
> ------------------------------------------------------------------------
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 05 Jun 2008 - 22:22:37 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 05 Jun 2008 - 23:30:51 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive