Re: [R] Rapache ( was Developing a web crawler )

From: Matt Shotwell <matt_at_biostatmatt.com>
Date: Sun, 06 Mar 2011 13:51:53 -0500

On Sun, 2011-03-06 at 08:06 -0500, Mike Marchywka wrote:
>
>
>
>
>
> ----------------------------------------
> > Date: Thu, 3 Mar 2011 13:04:11 -0600
> > From: Matt.Shotwell_at_vanderbilt.edu
> > To: r-help_at_r-project.org
> > Subject: Re: [R] Developing a web crawler / R "webkit" or something similar? [off topic]
> >
> > On 03/03/2011 08:07 AM, Mike Marchywka wrote:
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >> Date: Thu, 3 Mar 2011 01:22:44 -0800
> > >> From: antujsrv_at_gmail.com
> > >> To: r-help_at_r-project.org
> > >> Subject: [R] Developing a web crawler
> > >>
> > >> Hi,
> > >>
> > >> I wish to develop a web crawler in R. I have been using the functionalities
> > >> available under the RCurl package.
> > >> I am able to extract the html content of the site but i don't know how to go
> > >
> > > In general this can be a big effort but there may be things in
> > > text processing packages you could adapt to execute html and javascript.
> > > However, I guess what I'd be looking for is something like a "webkit"
> > > package or other open source browser with or without an "R" interface.
> > > This actually may be an ideal solution for a lot of things as you get
> > > all the content handlers of at least some browser.
> > >
> > >
> > > Now that you mention it, I wonder if there are browser plugins to handle
> > > "R" content ( I'd have to give this some thought, put a script up as
> > > a web page with mime type "test/R" and have it execute it in R. )
> >
> > There are server-side solutions for this sort of thing. See
> > http://rapache.net/ . Also, there was a string of messages on R-devel
> > some years ago addressing the mime type issue; beginning here:
> > http://tolstoy.newcastle.edu.au/R/devel/05/11/3054.html . Though I don't
> > know whether there was a resolution. Some suggestions were text/x-R,
> > text/x-Rd, application/x-RData.
> >
> The rapache demo looks like something I could use right away
> but I haven't looked into the handlers yet. I have installed rapache now
> on my debian system ( still have config issues but I did get apach2 to restart LOL)
> Before I plow into this too far, how would this compare/compete with something
> like a PHP library for Rserve? That is the approach I had been pursuing.
>
> Thanks.

Hi Mike,

If you've built and configured RApache, then the difficult "plowing" is over :). RApache operates at the top (HTTP) layer of the OSI stack, whereas Rserve works at the lower transport/network layer. Hence, the scope of Rserve applications is far more general. Extending Rserve to operate at the HTTP layer (via PHP) will mean more work.

RApache offers high level functionality, for example, to replace PHP with R in web pages. No interface code is necessary. Here's a simple "What's The Time?" webpage using RApache and yarr [1] to handle the code:

<< setContentType("text/html\n\n") >>
<html>
<head><title>What's The Time?</title></head>
<body><pre><</= cat(format(Sys.time(), usetz=TRUE)) >></pre></body>
</html>

Here's a live version: [2]. Interfacing PHP with Rserve in this context would be useful if installation of R and/or RApache on the web host were prohibited. A PHP/Rserve framework might also be useful in other contexts, for example, to extend PHP applications (e.g. WordPress, MediaWiki).

Best,
Matt

[1] http://biostatmatt.com/archives/1000 [2] http://biostatmatt.com/yarr/time.yarr

>
> > -Matt
> >
> > >
>
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sun 06 Mar 2011 - 18:56:08 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 23 Mar 2011 - 00:20:24 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive