Re: [R] Scrap java scripts and styles from an html document

From: antujsrv <antujsrv_at_gmail.com>
Date: Thu, 07 Apr 2011 04:15:50 -0700 (PDT)

Hi ,

I am working on developing a web crawler. Removing javascripts and styles is a part of the cleaning of the html document.
What I want is a cleaned html document with only the html tags and textual information,
so that i can figure out the pattern of the web page. This is being done to extract relevant
information from the webpage like comments for a particular product.

For e.g the amazon.com has all such comments within the  and tags,
with regular
 occuring for breaks. So tags which appear the most help us in  locating the required information. Different websites have different patterns,
but its more likely that tags that will occur the most will have the relevant information enclosed in them.

So, once the html page is cleaned, it would be easy to role up the tags and knowing their frequency of occurrence, we can target the information.

Should there be any suggestions to help, please let me know. I would be more than pleased.

Regards,
Antuj

--
View this message in context: http://r.789695.n4.nabble.com/Scrap-java-scripts-and-styles-from-an-html-document-tp3413894p3433052.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Thu 07 Apr 2011 - 11:48:01 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 07 Apr 2011 - 15:30:27 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive