[R] newbie xml parsing question

From: eric <ericstrom_at_aol.com>
Date: Sat, 28 May 2011 14:02:30 -0700 (PDT)


I am trying to read some data off the zillow site. Newbie to xml, html, parsing and the xml package. I've been able to load the web page I'm interested with the following code but I'm not sure of the next step to get the information I'm interested in into R :

library(XML)
url <- "http://www.zillow.com/homes/511 W Lafayette St, Norristown, PA_rb" doc <-doc <- htmlTreeParse(url1, isURL=TRUE) doc

I'd like to be able to pull the following information into R

href home details string :

/homedetails/236-Arundel-Ave-Horsham-PA-19044/9933810_zpid/#{scid=hdp-site-map-bubble-address}

value for Zestimate \ Price: $239,000

Beds : 3
Baths: 1.0
Sqft :1630

I noticed all that information is in "doc". The section of doc where the information is contained is shown below. How do I go about extracting this information and getting it into R for the general case where the address in url will change ?

LatLong.createFromDegrees(40.187567, -75.125861), "<div class=\"map-bubble property-bubble\"> <div class=\"search-result\">
<div class=\"plisting\"> <div id=\"bubble-photoex-up\" class=\"photoex
hide\"> <div class=\"photoex-photos\"> </div> <div class=\"mapsViews hide\">
</div> </div> <div id=\"property-zpid\" class=\"hide\">9933810</div> <div
id=\"property-home-info\"> <div id=\"pinfo-block\" class=\"property-info\">
<div class=\"adr\">

\"/homedetails/236-Arundel-Ave-Horsham-PA-19044/9933810_zpid/#{scid=hdp-site-map-bubble-address}\" 236 Arundel Ave, Horsham, PA </div> <ul class=\"value-info\"> <li class=\"type-allHomes\"> &nbsp; Zestimate<sup>&reg;</sup>: $239,000 \"#\"
<div id=\"zest-tip-bubble_toggleArea\" class=\"tooltip hide\"> Close <dl>
<dt>Zestimate</dt> <dd> A <strong>Zestimate&reg;</strong> home valuation is
Zillow's estimated market value. It is not an appraisal. Use it as a starting point to determine a home's value. &lt;a href=\&quot;/wikipages/What-is-a-Zestimate/\&quot; href=\&quot;#\&quot;&gt;Learn more </dd> </dl> </div> </li> <li class=\"secondary monthly-payment\"> Mortgage payment: $963/mo <ul class=\"carrot view-rates-aftertext\"> <li> \"/mortgage-rates/#{scid=mor-site-mapbubrates}\" See rates </li></ul> </li>
</ul> <ul class=\"attributes\"> <li class=\"prop-cola\">Beds: 3<br /> Baths:
1.0</li> <li class=\"prop-colb\">Sqft: 1,630<br /> Lot: 21,745</li> </ul>
</div> <ul class=\"has-photo actions clearfix\"> <li class=\"hinfo ztsa\">
\"/homedetails/236-Arundel-Ave-Horsham-PA-19044/9933810_zpid/#{scid=hdp-site-map-bubble-details}\" Details </li> <li class=\"mapHome ztsa\" zpid=\"9933810\"> \"#\" Views
</li> <li class=\"faves ztsa\"> &lt;a onclick=\&quot;trackLink(this, 'Save',
{ 'events': 'event18', 'eVar4': 'Map Bubble' }); return favoriteManager.addFavorite(9933810, favoriteManager.doneSaving(this), event, true);\&quot; class=\&quot;not-saved\&quot; rel=\&quot;nofollow\&quot;&gt;Save </li> </ul> </div> Close <div id=\"bubble-photoex-down\" class=\"photoex hide\"> <div

class=\"photoex-photos\"> </div>	<div class=\"mapsViews hide\"> </div>

</div> </div> </div> <div class=\"bubble-beak\">&nbsp;</div></div>"
)
--
View this message in context: http://r.789695.n4.nabble.com/newbie-xml-parsing-question-tp3558067p3558067.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Sun 29 May 2011 - 06:39:00 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 31 May 2011 - 15:30:11 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive