From: Charles C. Berry <cberry_at_tajo.ucsd.edu>

Date: Wed, 17 Nov 2010 12:24:54 -0800

http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901

R-devel_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 17 Nov 2010 - 20:33:08 GMT

Date: Wed, 17 Nov 2010 12:24:54 -0800

On Wed, 17 Nov 2010, Dickison, Daniel wrote:

> I downloaded and compiled the standalone TRE agrep command line program,

*> and I think I have a slightly better idea of what's going on. Basically
**> R's agrep, like the command line tool, is matching all strings that
**> *contain* the pattern. So, essentially, insertions before and after the
**> pattern is "free".
**>
**> As far as I can tell, there isn't an option to require full-string matches
**> using the TRE library. It should be possible to not use REG_LITERAL and
**> surround the pattern with ^ and $, but that would require escaping all
**> special characters in the original pattern.
**>
**> Is this something worth pursuing? (For my immediate needs I'll probably
**> create a separate function that passes the regex directly to TRE without
**> REG_LITERAL).
**>
*

I am joining this thread late, but I wonder if reversing agrep's 'pattern' and 'x' args serves the OP's need.

viz.

*> sapply( c("x","xy","xyz","xyza"),
*

+ function(y) any( agrep( y, "x", max=list(all=1))))

x xy xyz xyza

** TRUE TRUE FALSE FALSE
**
**HTH,
**
Chuck

> Daniel

*>
**> On 11/17/10 11:47 AM, "Joris Meys" <jorismeys_at_gmail.com> wrote:
**>
**>> It might have to do something with spaces and the interpretation of
**>> insertions, as far as I understand the following examples :
**>>
**>>> agrep("x",c("x","xy","xyz","xyza"),max=list(all=1))
**>> [1] 1 2 3 4
**>>> agrep("x ",c("x ","xy ","xyz ","xyza"),max=list(all=1))
**>> [1] 1
**>>> agrep("xx",c("xx","xyx","xyzx","xyzax",max=list(all=1)))
**>> [1] 1 2 3 4
**>>> agrep("xx",c("xx","xyx","xyzx","xyzax",max=list(ins=1)))
**>> [1] 1 2 3 4
**>>> agrep("xx ",c("xx ","xyx ","xyzx ","xyzax",max=list(all=2)))
**>> [1] 1
**>>> agrep("xx ",c("xx ","xyx ","xyzx ","xyzax",max=list(all=3)))
**>> [1] 1
**>>
**>> If the sequences are made the same length in spaces, this function
**>> gives the expected result in the second example, but it definitely
**>> doesn't do that any more when you start playing around with
**>> insertions. If not a bug, it definitely behaves pretty weird...
**>>
**>> Cheers
**>> Joris
**>>
**>> On Wed, Nov 17, 2010 at 4:49 PM, Dickison, Daniel
**>> <ddickison_at_carnegielearning.com> wrote:
**>>> I posted this yesterday to r-help and Ben Bolker suggested reposting it
**>>> here...
**>>>
**>>> Dickison, Daniel <ddickison <at> carnegielearning.com> writes:
**>>>
**>>>>
**>>>> The documentation for agrep says it uses the Levenshtein edit distance,
**>>>> but it seems to get this wrong in certain cases when there is a
**>>>> combination of deletions and substitutions. For example:
**>>>>
**>>>>> agrep("abcd", "abcxyz", max.distance=1)
**>>>> [1] 1
**>>>>
**>>>> That should've been a no-match. The edit distance between those
**>>>> strings
**>>>> is 3 (1 substitution, 2 deletions), but agrep matches with max.distance
**>>>>> =
**>>>> 1.
**>>>>
**>>>> I didn't find anything in the bug database, so I was wondering if
**>>>> somehow
**>>>> I'm misinterpreting how agrep works. If not, should I file this in
**>>>> Bugzilla?
**>>>>
**>>>
**>>> Could you re-post this on r-devel? It definitely sounds like
**>>> this is worth following up. Based on a little bit of playing around,
**>>> it's quite clear that I don't understand what's going on. The examples
**>>> show things like
**>>>
**>>> agrep("lasy","lazy",max=list(sub=0))
**>>>
**>>> which makes sense, but
**>>>
**>>> agrep("lasy","lazybc",max=1)
**>>> agrep("lasy","lazybc",max=0.001)
**>>> agrep("lasy","layt",max=list(all=1))
**>>>
**>>> and
**>>>
**>>> agrep("x",c("x","xy","xyz","xyza"),max=list(insertions=2))
**>>> agrep("x",c("x","xy","xyz","xyza"),max=list(deletions=2))
**>>> agrep("x",c("x","xy","xyz","xyza"),max=list(all=2))
**>>>
**>>> all give "1 2 3 4" ??
**>>>
**>>> this makes it clear that I really don't understand what's going on
**>>> based on the documentation. I tried to trace into the C code
**>>> (which calls functions from the TRE regexp library) but that didn't
**>>> help much ...
**>>>
**>>>
**>>>
**>>> Daniel Dickison
**>>> Research Programmer
**>>> ddickison_at_carnegielearning.com
**>>> Toll Free: (888) 851-7094 x103
**>>> FAX: (412) 690-2444
**>>>
**>>> Revolutionary Math Curricula. Revolutionary Results.
**>>>
**>>> Carnegie Learning, Inc. | 437 Grant St. 20th Floor | Pittsburgh, PA
**>>> 15219
**>>> www.carnegielearning.com
**>>>
**>>> ______________________________________________
**>>> R-devel_at_r-project.org mailing list
**>>> https://stat.ethz.ch/mailman/listinfo/r-devel
**>>>
**>>
**>>
**>>
**>> --
**>> Joris Meys
**>> Statistical consultant
**>>
**>> Ghent University
**>> Faculty of Bioscience Engineering
**>> Department of Applied mathematics, biometrics and process control
**>>
**>> tel : +32 9 264 59 87
**>> Joris.Meys_at_Ugent.be
**>> -------------------------------
**>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
**>
**> ______________________________________________
**> R-devel_at_r-project.org mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-devel
**>
*

Charles C. Berry Dept of Family/Preventive Medicine cberry_at_tajo.ucsd.edu UC San Diego

http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901

R-devel_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 17 Nov 2010 - 20:33:08 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Thu 18 Nov 2010 - 01:10:23 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel.
Please read the posting
guide before posting to the list.
*