Re: [R] Scrap java scripts and styles from an html document

From: Mike Marchywka <marchywka_at_hotmail.com>
Date: Thu, 07 Apr 2011 08:18:52 -0400



> Date: Thu, 7 Apr 2011 04:15:50 -0700
> From: antujsrv@gmail.com
> To: r-help_at_r-project.org
> Subject: Re: [R] Scrap java scripts and styles from an html document
>
> Hi ,
>
> I am working on developing a web crawler.

Comments like this come up on the list every few weeks or so and I keep suggesting that someone ( other than me of course LOL) investigates an R interface to webkit for any efforts that require mimic of large parts of a browser function. Perhaps just make a debug build or custom build of webkit to dump whatever it is you want into a structured text file
( I've actually done this for what would amount to a crawler, I modified maybe one or two classes to output the links being fetched to stdout but I think there are ways to dump a DOM or other stuff in a format usable by R). For  valid pages, you can  just parse html as xml and get what you want in this case but usually people are looking for information only apparent after large pieces of js are executed. If you want comments only, these may be easy to isolate yourself.If you google "CRAN HTML parser" some hits do come up, for example

http://cran.r-project.org/web/packages/scrapeR/scrapeR.pdf

http://r.789695.n4.nabble.com/How-to-import-HTML-and-SQL-files-td879480.html

> Removing javascripts and styles is a part of the cleaning of the html
> document.
> What I want is a cleaned html document with only the html tags and textual
> information,
> so that i can figure out the pattern of the web page. This is being done to
> extract relevant
> information from the webpage like comments for a particular product.
>
> For e.g the amazon.com has all such comments within the
> and tags,
> with regular
> occuring for breaks. So tags which appear the most help us in
> locating the required information. Different websites have different
> patterns,
> but its more likely that tags that will occur the most will have the
> relevant information enclosed in them.
>
> So, once the html page is cleaned, it would be easy to role up the tags and
> knowing their frequency of occurrence, we can target the information.
>
> Should there be any suggestions to help, please let me know. I would be more
> than pleased.
>
> Regards,
> Antuj
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Scrap-java-scripts-and-styles-from-an-html-document-tp3413894p3433052.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
                                               



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 07 Apr 2011 - 15:23:34 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 07 Apr 2011 - 15:30:27 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive