Re: [R] Developing a web crawler / R "webkit" or something similar?

From: Mike Marchywka <marchywka_at_hotmail.com>
Date: Thu, 03 Mar 2011 09:07:19 -0500

> Date: Thu, 3 Mar 2011 01:22:44 -0800
> From: antujsrv_at_gmail.com
> To: r-help_at_r-project.org
> Subject: [R] Developing a web crawler
>
> Hi,
>
> I wish to develop a web crawler in R. I have been using the functionalities
> available under the RCurl package.
> I am able to extract the html content of the site but i don't know how to go

In general this can be a big effort but there may be things in text processing packages you could adapt to execute html and javascript. However, I guess what I'd be looking for is something like a "webkit" package or other open source browser with or without an "R" interface. This actually may be an ideal solution for a lot of things as you get all the content handlers of at least some browser.

Now that you mention it, I wonder if there are browser plugins to handle "R" content ( I'd have to give this some thought, put a script up as a web page with mime type "test/R" and have it execute it in R. )

> about analyzing the html formatted document.
> I wish to know the frequency of a word in the document. I am only acquainted
> with analyzing data sets.
> So how should i go about analyzing data that is not available in table
> format.
>
> Few chunks of code that i wrote:
> w <-
> getURL("http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes")
> write.table(w,"test.txt")
> t <- readLines(w)
>
> readLines also didnt prove out to be of any help.
>
> Any help would be highly appreciated. Thanks in advance.
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
                                               



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 03 Mar 2011 - 15:16:36 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 03 Mar 2011 - 22:40:19 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive