[R] Extracting a data.frame from HTML code

From: Ethan Pew <ethanpew+rlist_at_gmail.com>
Date: Sat, 12 Apr 2008 16:47:18 -0600

Dear all,

I'd like to use R to read in data from the web. I need some help finding an efficient way to strip the HTML tags and reformat the data as a data.frame to analyze in R.

I'm currently using readLines() to read in the HTML code and then grep() to isolate the block of HTML code I want from each page, but this may not be the best approach.

A short example:
x1 <- readLines("

grep1 <- grep("<table",x1,value=FALSE)
grep2 <- grep("</table>",x1,value=FALSE)

block1 <- x1[grep1:grep2]

It seems like there should be a straightforward solution to extract a data.frame from the HTML code (especially since the data is already formatted as a table) but I haven't had any luck in my searches so far. Ultimately I'd like to compile several datasets from multiple webpages and websites, and I'm optimistic that I can use R to automate the process. If someone could point me in the right direction, that would be fantastic.

Many thanks in advance,

Ethan Pew
Doctoral Candidate, Marketing
Leeds School of Business
University of Colorado at Boulder

        [[alternative HTML version deleted]]

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 12 Apr 2008 - 22:51:50 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 13 Apr 2008 - 00:30:27 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive