Re: [Rd] extending strsplit(): supply pattern to keep, not to split by

From: Gabor Grothendieck <ggrothendieck_at_gmail.com>
Date: Thu 06 Apr 2006 - 13:11:57 GMT

To follow up, strapply has been added to the gsubfn package (gsubfn 0.1-1) which should make it easier to address this problem.

Its basically just a sapply call around gsubfn which returns the transformed matches rather than performing substitution. Its analogous to apply:

	apply(object, margin, function)
	strapply(object, pattern, function)

(The arguments shown above are not a complete list nor are they they actual arg names but are simply intended to show the close parallel between strapply and apply.)

The default function in strapply returns its first argument so for this problem we could omit the function altogether and write:

  library(gsubfn) # ver 0.1-1 needed
  x <- c("12;34:56,89,,12", "1.2, .4, 1., 1e3")   strapply(x, number.pattern)

See ?strapply for more info.

On 4/4/06, Gabor Grothendieck <ggrothendieck@gmail.com> wrote:
> On 4/4/06, Bill Dunlap <bill@insightful.com> wrote:
> > On Tue, 4 Apr 2006, Gabor Grothendieck wrote:
> >
> > > gsubfn in package gsubfn can do this. See the examples
> > > in ?gsubfn
> >
> > Thanks. gsubfn looks useful, but may be overkill
> > for this, and it isn't vectorized. To do what
>
> gsubfn is vectorized. Its just that you are not using the output of
> gsubfn in this case.
>
> > strsplit(keep=T) would do I think you need to do something like:
> >
> > > findMatches<-function(strings, pattern){
> > lapply(strings, function(string){
> > v <- character()
> > gsubfn(number.pattern, function(x,...)v<<-c(v,x), string)
> > v})
> > }
> > > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"
> > > findMatches(c("12;34:56,89,,12", "1.2, .4, 1., 1e3"), number.pattern)
> > [[1]]
> > [1] "12" "34" "56" "89" "12"
> >
> > [[2]]
> > [1] "1.2" ".4" "1." "1e3"
> >
> > Is this worth encapsulating in a standard R function?
>
> I will likely add a wrapper to the gsubfn package for this.
>
> > If so, is doing via an extra argument to strsplit()
> > a reasonable way to do it?
>
> My current thought was to create a strapply function to do that.
>
> >
> > > strsplit(c("12;34:56,89,,12", "1.2, .4, 1., 1e3"), number.pattern, keep=T)
> > [[1]]:
> > [1] "12" "34" "56" "89" "12"
> >
> > [[2]]:
> > [1] "1.2" ".4" "1." "1e3"
> >
> >
> > > On 4/4/06, Bill Dunlap <bill@insightful.com> wrote:
> > > > strsplit() is a convenient way to get a
> > > > list of items from a string when you
> > > > have a regular expression for what is not
> > > > an item. E.g.,
> > > >
> > > > > strsplit("1.2, 34, 1.7e-2", split="[ ,] *")
> > > > [[1]]:
> > > > [1] "1.2" "34" "1.7e-2"
> > > >
> > > > However, sometimes is it more convenient to
> > > > give a pattern for the items you do want.
> > > > E.g., suppose you want to pull all the numbers
> > > > out of a string which contains a mix of numbers
> > > > and words. Making a pattern for what a
> > > > number is simpler than making a pattern
> > > > for what may come between the number.
> > > > > number.pattern <- "[-+]?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?"
> > > >
> > > > I propose adding a keep=FALSE argument to
> > > > strsplit() to do this. If keep is FALSE,
> > > > then the split argument matches the stuff to
> > > > omit from the output; if keep is TRUE then
> > > > split matches the stuff to put into the
> > > > output. Then we could do the following to
> > > > get a list of all the numbers in a string
> > > > (done in a version of strsplit() I'm working on
> > > > for S-PLUS):
> > > >
> > > > > strsplit("1.2, 34, 1.7e-2", split=number.pattern,keep=TRUE)
> > > > [[1]]:
> > > > [1] "1.2" "34" "1.7e-2"
> > > >
> > > > > strsplit("Ibuprofin 200mg", split=number.pattern,keep=TRUE)
> > > > [[1]]:
> > > > [1] "200"
> > > >
> > > > Is this a reasonable thing to want strsplit to do?
> > > > Is this a reasonable parameterization of it?
> >
> > ----------------------------------------------------------------------------
> > Bill Dunlap
> > Insightful Corporation
> > bill at insightful dot com
> > 360-428-8146
> >
> > "All statements in this message represent the opinions of the author and do
> > not necessarily reflect Insightful Corporation policy or position."
> >
>



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu Apr 06 23:36:51 2006

This archive was generated by hypermail 2.1.8 : Thu 06 Apr 2006 - 22:17:13 GMT