Re: [Rd] Bug in agrep computing edit distance?

From: Charles C. Berry <cberry_at_tajo.ucsd.edu>
Date: Wed, 17 Nov 2010 12:24:54 -0800

On Wed, 17 Nov 2010, Dickison, Daniel wrote:

> I downloaded and compiled the standalone TRE agrep command line program,
> and I think I have a slightly better idea of what's going on. Basically
> R's agrep, like the command line tool, is matching all strings that
> *contain* the pattern. So, essentially, insertions before and after the
> pattern is "free".
>
> As far as I can tell, there isn't an option to require full-string matches
> using the TRE library. It should be possible to not use REG_LITERAL and
> surround the pattern with ^ and $, but that would require escaping all
> special characters in the original pattern.
>
> Is this something worth pursuing? (For my immediate needs I'll probably
> create a separate function that passes the regex directly to TRE without
> REG_LITERAL).
>

I am joining this thread late, but I wonder if reversing agrep's 'pattern' and 'x' args serves the OP's need.

viz.

> sapply( c("x","xy","xyz","xyza"),

+ function(y) any( agrep( y, "x", max=list(all=1))))

     x xy xyz xyza
  TRUE TRUE FALSE FALSE HTH, Chuck

> Daniel
>
> On 11/17/10 11:47 AM, "Joris Meys" <jorismeys_at_gmail.com> wrote:
>
>> It might have to do something with spaces and the interpretation of
>> insertions, as far as I understand the following examples :
>>
>>> agrep("x",c("x","xy","xyz","xyza"),max=list(all=1))
>> [1] 1 2 3 4
>>> agrep("x ",c("x ","xy ","xyz ","xyza"),max=list(all=1))
>> [1] 1
>>> agrep("xx",c("xx","xyx","xyzx","xyzax",max=list(all=1)))
>> [1] 1 2 3 4
>>> agrep("xx",c("xx","xyx","xyzx","xyzax",max=list(ins=1)))
>> [1] 1 2 3 4
>>> agrep("xx ",c("xx ","xyx ","xyzx ","xyzax",max=list(all=2)))
>> [1] 1
>>> agrep("xx ",c("xx ","xyx ","xyzx ","xyzax",max=list(all=3)))
>> [1] 1
>>
>> If the sequences are made the same length in spaces, this function
>> gives the expected result in the second example, but it definitely
>> doesn't do that any more when you start playing around with
>> insertions. If not a bug, it definitely behaves pretty weird...
>>
>> Cheers
>> Joris
>>
>> On Wed, Nov 17, 2010 at 4:49 PM, Dickison, Daniel
>> <ddickison_at_carnegielearning.com> wrote:
>>> I posted this yesterday to r-help and Ben Bolker suggested reposting it
>>> here...
>>>
>>> Dickison, Daniel <ddickison <at> carnegielearning.com> writes:
>>>
>>>>
>>>> The documentation for agrep says it uses the Levenshtein edit distance,
>>>> but it seems to get this wrong in certain cases when there is a
>>>> combination of deletions and substitutions. For example:
>>>>
>>>>> agrep("abcd", "abcxyz", max.distance=1)
>>>> [1] 1
>>>>
>>>> That should've been a no-match. The edit distance between those
>>>> strings
>>>> is 3 (1 substitution, 2 deletions), but agrep matches with max.distance
>>>>> =
>>>> 1.
>>>>
>>>> I didn't find anything in the bug database, so I was wondering if
>>>> somehow
>>>> I'm misinterpreting how agrep works. If not, should I file this in
>>>> Bugzilla?
>>>>
>>>
>>> Could you re-post this on r-devel? It definitely sounds like
>>> this is worth following up. Based on a little bit of playing around,
>>> it's quite clear that I don't understand what's going on. The examples
>>> show things like
>>>
>>> agrep("lasy","lazy",max=list(sub=0))
>>>
>>> which makes sense, but
>>>
>>> agrep("lasy","lazybc",max=1)
>>> agrep("lasy","lazybc",max=0.001)
>>> agrep("lasy","layt",max=list(all=1))
>>>
>>> and
>>>
>>> agrep("x",c("x","xy","xyz","xyza"),max=list(insertions=2))
>>> agrep("x",c("x","xy","xyz","xyza"),max=list(deletions=2))
>>> agrep("x",c("x","xy","xyz","xyza"),max=list(all=2))
>>>
>>> all give "1 2 3 4" ??
>>>
>>> this makes it clear that I really don't understand what's going on
>>> based on the documentation. I tried to trace into the C code
>>> (which calls functions from the TRE regexp library) but that didn't
>>> help much ...
>>>
>>>
>>>
>>> Daniel Dickison
>>> Research Programmer
>>> ddickison_at_carnegielearning.com
>>> Toll Free: (888) 851-7094 x103
>>> FAX: (412) 690-2444
>>>
>>> Revolutionary Math Curricula. Revolutionary Results.
>>>
>>> Carnegie Learning, Inc. | 437 Grant St. 20th Floor | Pittsburgh, PA
>>> 15219
>>> www.carnegielearning.com
>>>
>>> ______________________________________________
>>> R-devel_at_r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>>
>>
>> --
>> Joris Meys
>> Statistical consultant
>>
>> Ghent University
>> Faculty of Bioscience Engineering
>> Department of Applied mathematics, biometrics and process control
>>
>> tel : +32 9 264 59 87
>> Joris.Meys_at_Ugent.be
>> -------------------------------
>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Charles C. Berry                            Dept of Family/Preventive Medicine
cberry_at_tajo.ucsd.edu			    UC San Diego

http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901

R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 17 Nov 2010 - 20:33:08 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 18 Nov 2010 - 01:10:23 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive