Re: [R] Find String Between Characters

From: William Dunlap <wdunlap_at_tibco.com>
Date: Sun, 15 May 2011 12:46:50 -0700

It looks like you can get the text of the document with
  as(mmm[[1]], "character")
and you can use grep, strsplit, gsub, etc. on that text.

Look at the functions in the XML pacakge for ways to use the XML structure of the data instead of pattern matching to extract meaningful parts of the document.

class?HTMLInternalDocument

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: r-help-bounces_at_r-project.org
> [mailto:r-help-bounces_at_r-project.org] On Behalf Of Sparks, John James
> Sent: Saturday, May 14, 2011 7:14 PM
> To: jim holtman
> Cc: r-help_at_r-project.org
> Subject: Re: [R] Find String Between Characters
>
> Hi Jim,
>
> Thanks for your note.
>
> Unfortunately, when I attempt your solution in my exact
> setting, I get a
> weird and slightly different answer.
>
> First, let me be more clear. What I am attempting to do is
> pull the CIK
> number out of the information from the web page itself after
> it has loaded
> to R (this may not be optimal, but I am new at this), not from the web
> page reference (as you have done).
>
> So, when I execute the following as per your suggestion:
>
> require(scrapeR)
> mmm<-scrape(url="http://www.sec.gov/cgi-bin/browse-edgar?actio
n=getcompany&CIK=0000320193&owner=exclude&count=40")
>
> num <- sub("^.*CIK=([0-9]+).*", "\\1", mmm)
>
> I get
> [1] "<pointer: 0x00000000001265c0>"
>
> Is this just a hex representation of the same number, or is
> something else
> going on here?
>
> Comments from any and all would be much appreciated.
>
> --John J. Sparks, Ph.D.
>
> On Sat, May 14, 2011 7:57 pm, jim holtman wrote:
> > Is this what you want:
> >
> >>
> mmm<-"http://www.sec.gov/cgi-bin/browse-edgar?action=getcompan
y&CIK=0000320193&owner=exclude&count=40"
> >> num <- sub("^.*CIK=([0-9]+).*", "\\1", mmm)
> >> num
> > [1] "0000320193"
> >>
> >
> >
> > On Sat, May 14, 2011 at 8:20 PM, Sparks, John James
> <jspark4_at_uic.edu>
> > wrote:
> >> Dear R Helpers,
> >>
> >> I am trying to isolate a set of characters between two
> other characters
> >> in
> >> a long string file.  I tried some of the examples on the R
> help pages
> >> and
> >> elsewhere, but I am not able to get it.  Your help would be much
> >> appreciated.
> >>
> >> require(scrapeR)
> >>
> mmm<-scrape(url="http://www.sec.gov/cgi-bin/browse-edgar?actio
n=getcompany&CIK=0000320193&owner=exclude&count=40")
> >> str(mmm)
> >>
> >> I want to get the number 0000320193 that is between the
> CIK= and the &.
> >>  I
> >> have tried
> >>
> >> g <- grep( "CIK=|&", mmm )
> >> and
> >> temp<-grep(mmm,\CIK=\&)
> >>
> >> and variations on these themes, but all won't run or come
> bask as an
> >> empty
> >> object.  How can I grab this number?
> >>
> >> Best wishes,
> >> --John J. Sparks, Ph.D.
> >>
> >> ______________________________________________
> >> R-help_at_r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >
> >
> > --
> > Jim Holtman
> > Data Munger Guru
> >
> > What is the problem that you are trying to solve?
> >
> >
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sun 15 May 2011 - 19:49:36 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sun 15 May 2011 - 19:50:06 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive