Re: [R] Developing a web crawler / R "webkit" or something similar? [off topic]

From: Matt Shotwell <Matt.Shotwell_at_vanderbilt.edu>
Date: Thu, 03 Mar 2011 13:04:11 -0600

On 03/03/2011 08:07 AM, Mike Marchywka wrote:
>
>
>
>
>
>
>
>> Date: Thu, 3 Mar 2011 01:22:44 -0800
>> From: antujsrv_at_gmail.com
>> To: r-help_at_r-project.org
>> Subject: [R] Developing a web crawler
>>
>> Hi,
>>
>> I wish to develop a web crawler in R. I have been using the functionalities
>> available under the RCurl package.
>> I am able to extract the html content of the site but i don't know how to go
>
> In general this can be a big effort but there may be things in
> text processing packages you could adapt to execute html and javascript.
> However, I guess what I'd be looking for is something like a "webkit"
> package or other open source browser with or without an "R" interface.
> This actually may be an ideal solution for a lot of things as you get
> all the content handlers of at least some browser.
>
>
> Now that you mention it, I wonder if there are browser plugins to handle
> "R" content ( I'd have to give this some thought, put a script up as
> a web page with mime type "test/R" and have it execute it in R. )

There are server-side solutions for this sort of thing. See http://rapache.net/ . Also, there was a string of messages on R-devel some years ago addressing the mime type issue; beginning here: http://tolstoy.newcastle.edu.au/R/devel/05/11/3054.html . Though I don't know whether there was a resolution. Some suggestions were text/x-R, text/x-Rd, application/x-RData.

-Matt

>
>
>
>> about analyzing the html formatted document.
>> I wish to know the frequency of a word in the document. I am only acquainted
>> with analyzing data sets.
>> So how should i go about analyzing data that is not available in table
>> format.
>>
>> Few chunks of code that i wrote:
>> w<-
>> getURL("http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes")
>> write.table(w,"test.txt")
>> t<- readLines(w)
>>
>> readLines also didnt prove out to be of any help.
>>
>> Any help would be highly appreciated. Thanks in advance.
>>
>>
>> --
>> View this message in context: http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Matthew S Shotwell   Assistant Professor           School of Medicine
                      Department of Biostatistics   Vanderbilt University

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Thu 03 Mar 2011 - 22:35:45 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 06 Mar 2011 - 14:50:19 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive