Re: [Rd] Retrieving data from aspx pages

From: Paul Gilbert <pgilbert902_at_gmail.com>
Date: Wed, 31 Oct 2012 13:45:55 -0400

I must be really dense, I know RCurl provides a POST capability, but I don't see how this allows "interaction". Suppose the example actually worked, which it does not. (Unfortunately many of the examples in RCurl seem to be marked \dontrun{} or disabled with some if() condition.) When you post to a page like this you will often get something back that has a dynamically generated URI, and you will need to post more information to that page. But how do you find out the URI of that next dynamically generated page? Even when you know what you will need to post, you need the URI to do it. If RCurl provided interaction you would be able get the URI so you could post to the next page. Maybe you can do that, but I have not discover how. If you know how, I would appreciate a real working example.

Paul

On 12-10-31 12:14 PM, jose ramon mazaira wrote:
> I'd like to make you note that I've discovered that package RCurl
> already provides a utility that allows interaction via POST requests
> with servers. In fact, the FAQ for RCurl contains specifically an
> example with an aspx page:
>
> x = postForm("
http://www.fas.usda.gov/psdonline/psdResult.aspx",
> style = "post",
> .params = list(visited="1",
> lstGroup = "all",
> lstCommodity="2631000",
> lstAttribute="88",
> lstCountry="**",
> lstDate="2011",
> lstColumn="Year",
> lstOrder="Commodity%2FAttribute%2FCountry"))
>
> Check this link: http://www.omegahat.org/RCurl
> However, I think that it would be more useful to automate the
> interaction with servers retrieving automatically the name-value pairs
> required by the server (parsing the page source code) instead of
> examining in each web page the appropiate fields.
>
> 2012/10/30, Paul Gilbert <pgilbert902_at_gmail.com>:
>> Jose
>>
>> As far as getting to the data, I think the best way to do this sort of
>> thing would be if the site supports a SOAP or REST interface. When they
>> don't (yet) then one is faced with clicking through some pages. Python
>> or Java is one way to automate the process of clicking through the
>> pages. I don't know how to do that in R, but would like to know if it is
>> possible.
>>
>> But, I guess I was confused about the part you want to improve. What I
>> have works fairly smoothly parsing and passing back JSON data, converted
>> from a csv file, into R. The downside is that this approach requires
>> more than R to be installed on the client machine. But if the object you
>> get back is ASPX, then you either need to parse it directly, or convert
>> it to JSON, or something else you can deal with. I suspect that will be
>> fairly specific to a particular web site, but I don't really know enough
>> about ASPX to be sure.
>>
>> Paul
>>
>> On 12-10-30 01:12 PM, jose ramon mazaira wrote:
>>> Thanks for your interest, Paul.
>>> I've checked the source code of TSjson and I've seen that what it does
>>> is to call a Python script to retrieve the data. In fact, I've already
>>> done this with Java using the URLConnection class and sending the
>>> requested values to fill the form.
>>> However, I think it would be more useful to open a connection with R
>>> and to send the requested values within R, and not through an external
>>> program.
>>> The application I've designed, like yours, is also page-specific
>>> (i.e., designed for
>>> http://cxa.gtm.idmanagedsolutions.com/finra/BondCenter/AdvancedScreener.aspx),
>>> but I think that our applications would be more powerful if they were
>>> able to parse the name-value pairs generated from ASPX (or of any
>>> other dynamically generated web page) and ask the user to select the
>>> appropiate values.
>>>
>>> 2012/10/30, Paul Gilbert <pgilbert902_at_gmail.com>:
>>>> I think RHTMLForms works if you have a single form, but I have not been
>>>> able to see how to use it when you need to go through a sequence of
>>>> dynamically generated forms (like you can do with Python mechanize).
>>>>
>>>> Paul
>>>>
>>>> On 12-10-30 09:08 AM, Gabriel Becker wrote:
>>>>> I haven't used it extensively myself, and can't speak to it's current
>>>>> state but on quick inspection RHTMLForms seems worth a look for what
>>>>> you
>>>>> want.
>>>>>
>>>>> http://www.omegahat.org/RHTMLForms/
>>>>>
>>>>> ~G
>>>>>
>>>>> On Tue, Oct 30, 2012 t 5:38 AM, Paul Gilbert <pgilbert902_at_gmail.com
>>>>> <mailto:pgilbert902_at_gmail.com>> wrote:
>>>>>
>>>>> I don't know of an easy way to do this in R. I've been doing
>>>>> something similar with python scripts called from R. If anyone
>>>>> knows
>>>>> how to do this with just R, I would appreciate hearing too.
>>>>>
>>>>> Paul
>>>>>
>>>>>
>>>>> On 12-10-29 04:11 PM, jose ramon mazaira wrote:
>>>>>
>>>>> Hi. I'm trying to write an application to retrieve financial
>>>>> data
>>>>> (specially bonds data) from FINRA. The web page is served
>>>>> dynamically
>>>>> from an asp.net <http://asp.net> application:
>>>>>
>>>>>
>>>>> http://cxa.gtm.__idmanagedsolutions.com/finra/__BondCenter/AdvancedScreener.__aspx
>>>>>
>>>>> <http://cxa.gtm.idmanagedsolutions.com/finra/BondCenter/AdvancedScreener.aspx>
>>>>>
>>>>> I'd like to know if it's possible to fill dynamically the web
>>>>> page
>>>>> form from R and, after filling it (with the issuer name),
>>>>> retrieve the
>>>>> web page, parse the data, and covert it to appropiate R
>>>>> objects.
>>>>> For example, suppose I want to search data for AT&T bonds. I'd
>>>>> like to
>>>>> know if it's possible, within R, to fill the page served from:
>>>>>
>>>>>
>>>>> http://cxa.gtm.__idmanagedsolutions.com/finra/__BondCenter/AdvancedScreener.__aspx
>>>>>
>>>>> <http://cxa.gtm.idmanagedsolutions.com/finra/BondCenter/AdvancedScreener.aspx>
>>>>>
>>>>> select the "corporate" option and fill with AT&T the field for
>>>>> "Issuer
>>>>> name", ask the page to display the results, and retrieve the
>>>>> results
>>>>> for each of the bonds issued by AT&T (for example:
>>>>>
>>>>>
>>>>> http://cxa.gtm.__idmanagedsolutions.com/finra/__BondCenter/BondDetail.aspx?ID=__MDAxOTU3Qko3
>>>>>
>>>>> <http://cxa.gtm.idmanagedsolutions.com/finra/BondCenter/BondDetail.aspx?ID=MDAxOTU3Qko3>)
>>>>>
>>>>> and parsing the data from the web page.
>>>>>
>>>>> Thanks in advance.
>>>>>
>>>>> ________________________________________________
>>>>> R-devel_at_r-project.org <mailto:R-devel_at_r-project.org> mailing
>>>>> list
>>>>> https://stat.ethz.ch/mailman/__listinfo/r-devel
>>>>> <https://stat.ethz.ch/mailman/listinfo/r-devel>
>>>>>
>>>>>
>>>>> ________________________________________________
>>>>> R-devel_at_r-project.org <mailto:R-devel_at_r-project.org> mailing list
>>>>> https://stat.ethz.ch/mailman/__listinfo/r-devel
>>>>> <https://stat.ethz.ch/mailman/listinfo/r-devel>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Gabriel Becker
>>>>> Graduate Student
>>>>> Statistics Department
>>>>> University of California, Davis
>>>>>
>>>>
>>



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 31 Oct 2012 - 17:51:29 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 01 Nov 2012 - 13:20:49 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive