Re: [Rd] extending strsplit(): supply pattern to keep, not to split by

From: Bill Dunlap <bill_at_insightful.com>
Date: Tue 04 Apr 2006 - 17:10:16 GMT

On Tue, 4 Apr 2006, Gabor Grothendieck wrote:

> gsubfn in package gsubfn can do this. See the examples
> in ?gsubfn

Thanks. gsubfn looks useful, but may be overkill for this, and it isn't vectorized. To do what strsplit(keep=T) would do I think you need to do something like:

   > findMatches<-function(strings, pattern){

        lapply(strings, function(string){
               v <- character()
               gsubfn(number.pattern, function(x,...)v<<-c(v,x), string)
               v})
     }

   > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"    > findMatches(c("12;34:56,89,,12", "1.2, .4, 1., 1e3"), number.pattern)    [[1]]
   [1] "12" "34" "56" "89" "12"

   [[2]]
   [1] "1.2" ".4" "1." "1e3"

Is this worth encapsulating in a standard R function? If so, is doing via an extra argument to strsplit() a reasonable way to do it?

   > strsplit(c("12;34:56,89,,12", "1.2, .4, 1., 1e3"), number.pattern, keep=T)    [[1]]:
   [1] "12" "34" "56" "89" "12"

   [[2]]:
   [1] "1.2" ".4" "1." "1e3"

> On 4/4/06, Bill Dunlap <bill@insightful.com> wrote:
> > strsplit() is a convenient way to get a
> > list of items from a string when you
> > have a regular expression for what is not
> > an item. E.g.,
> >
> > > strsplit("1.2, 34, 1.7e-2", split="[ ,] *")
> > [[1]]:
> > [1] "1.2" "34" "1.7e-2"
> >
> > However, sometimes is it more convenient to
> > give a pattern for the items you do want.
> > E.g., suppose you want to pull all the numbers
> > out of a string which contains a mix of numbers
> > and words. Making a pattern for what a
> > number is simpler than making a pattern
> > for what may come between the number.
> > > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"
> >
> > I propose adding a keep=FALSE argument to
> > strsplit() to do this. If keep is FALSE,
> > then the split argument matches the stuff to
> > omit from the output; if keep is TRUE then
> > split matches the stuff to put into the
> > output. Then we could do the following to
> > get a list of all the numbers in a string
> > (done in a version of strsplit() I'm working on
> > for S-PLUS):
> >
> > > strsplit("1.2, 34, 1.7e-2", split=number.pattern,keep=TRUE)
> > [[1]]:
> > [1] "1.2" "34" "1.7e-2"
> >
> > > strsplit("Ibuprofin 200mg", split=number.pattern,keep=TRUE)
> > [[1]]:
> > [1] "200"
> >
> > Is this a reasonable thing to want strsplit to do?
> > Is this a reasonable parameterization of it?



Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do  not necessarily reflect Insightful Corporation policy or position."



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed Apr 05 03:41:00 2006

This archive was generated by hypermail 2.1.8 : Tue 04 Apr 2006 - 18:16:47 GMT