Re: [R] Developing a web crawler

From: <rex.dwyer_at_syngenta.com>
Date: Thu, 03 Mar 2011 08:58:07 -0500

Perl seems like a 10x better choice for the task, but try looking at the examples in ?strsplit to get started.

-----Original Message-----
From: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-project.org] On Behalf Of antujsrv Sent: Thursday, March 03, 2011 4:23 AM
To: r-help_at_r-project.org
Subject: [R] Developing a web crawler

Hi,

I wish to develop a web crawler in R. I have been using the functionalities available under the RCurl package.
I am able to extract the html content of the site but i don't know how to go about analyzing the html formatted document. I wish to know the frequency of a word in the document. I am only acquainted with analyzing data sets.
So how should i go about analyzing data that is not available in table format.

Few chunks of code that i wrote:
w <-
getURL("
http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes") write.table(w,"test.txt")
t <- readLines(w)

readLines also didnt prove out to be of any help.

Any help would be highly appreciated. Thanks in advance.

--
View this message in context: http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html
Sent from the R help mailing list archive at Nabble.com.


______________________________________________
R-help_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited.
______________________________________________
R-help_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Received on Thu 03 Mar 2011 - 14:06:54 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 03 Mar 2011 - 14:30:18 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive