Re: [Rd] Bug in agrep computing edit distance?

From: Dickison, Daniel <ddickison_at_carnegielearning.com>
Date: Wed, 17 Nov 2010 14:54:44 -0500


I downloaded and compiled the standalone TRE agrep command line program, and I think I have a slightly better idea of what's going on. Basically R's agrep, like the command line tool, is matching all strings that *contain* the pattern. So, essentially, insertions before and after the pattern is "free".

As far as I can tell, there isn't an option to require full-string matches using the TRE library. It should be possible to not use REG_LITERAL and surround the pattern with ^ and $, but that would require escaping all special characters in the original pattern.

Is this something worth pursuing? (For my immediate needs I'll probably create a separate function that passes the regex directly to TRE without REG_LITERAL). Daniel

On 11/17/10 11:47 AM, "Joris Meys" <jorismeys_at_gmail.com> wrote:

>It might have to do something with spaces and the interpretation of
>insertions, as far as I understand the following examples :
>
>> agrep("x",c("x","xy","xyz","xyza"),max=list(all=1))
>[1] 1 2 3 4
>> agrep("x ",c("x ","xy ","xyz ","xyza"),max=list(all=1))
>[1] 1
>> agrep("xx",c("xx","xyx","xyzx","xyzax",max=list(all=1)))
>[1] 1 2 3 4
>> agrep("xx",c("xx","xyx","xyzx","xyzax",max=list(ins=1)))
>[1] 1 2 3 4
>> agrep("xx ",c("xx ","xyx ","xyzx ","xyzax",max=list(all=2)))
>[1] 1
>> agrep("xx ",c("xx ","xyx ","xyzx ","xyzax",max=list(all=3)))
>[1] 1
>
>If the sequences are made the same length in spaces, this function
>gives the expected result in the second example, but it definitely
>doesn't do that any more when you start playing around with
>insertions. If not a bug, it definitely behaves pretty weird...
>
>Cheers
>Joris
>
>On Wed, Nov 17, 2010 at 4:49 PM, Dickison, Daniel
><ddickison_at_carnegielearning.com> wrote:
>> I posted this yesterday to r-help and Ben Bolker suggested reposting it
>> here...
>>
>> Dickison, Daniel <ddickison <at> carnegielearning.com> writes:
>>
>>>
>>> The documentation for agrep says it uses the Levenshtein edit distance,
>>> but it seems to get this wrong in certain cases when there is a
>>> combination of deletions and substitutions. For example:
>>>
>>> > agrep("abcd", "abcxyz", max.distance=1)
>>> [1] 1
>>>
>>> That should've been a no-match. The edit distance between those
>>>strings
>>> is 3 (1 substitution, 2 deletions), but agrep matches with max.distance
>>>>=
>>> 1.
>>>
>>> I didn't find anything in the bug database, so I was wondering if
>>>somehow
>>> I'm misinterpreting how agrep works. If not, should I file this in
>>> Bugzilla?
>>>
>>
>> Could you re-post this on r-devel? It definitely sounds like
>> this is worth following up. Based on a little bit of playing around,
>> it's quite clear that I don't understand what's going on. The examples
>> show things like
>>
>> agrep("lasy","lazy",max=list(sub=0))
>>
>> which makes sense, but
>>
>> agrep("lasy","lazybc",max=1)
>> agrep("lasy","lazybc",max=0.001)
>> agrep("lasy","layt",max=list(all=1))
>>
>> and
>>
>> agrep("x",c("x","xy","xyz","xyza"),max=list(insertions=2))
>> agrep("x",c("x","xy","xyz","xyza"),max=list(deletions=2))
>> agrep("x",c("x","xy","xyz","xyza"),max=list(all=2))
>>
>> all give "1 2 3 4" ??
>>
>> this makes it clear that I really don't understand what's going on
>> based on the documentation. I tried to trace into the C code
>> (which calls functions from the TRE regexp library) but that didn't
>> help much ...
>>
>>
>>
>> Daniel Dickison
>> Research Programmer
>> ddickison_at_carnegielearning.com
>> Toll Free: (888) 851-7094 x103
>> FAX: (412) 690-2444
>>
>> Revolutionary Math Curricula. Revolutionary Results.
>>
>> Carnegie Learning, Inc. | 437 Grant St. 20th Floor | Pittsburgh, PA
>>15219
>> www.carnegielearning.com
>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
>
>--
>Joris Meys
>Statistical consultant
>
>Ghent University
>Faculty of Bioscience Engineering
>Department of Applied mathematics, biometrics and process control
>
>tel : +32 9 264 59 87
>Joris.Meys_at_Ugent.be
>-------------------------------
>Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 17 Nov 2010 - 19:57:41 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 17 Nov 2010 - 23:30:22 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive