Re: [Rd] Bug in agrep computing edit distance?

From: Joris Meys <jorismeys_at_gmail.com>
Date: Wed, 17 Nov 2010 17:47:51 +0100

It might have to do something with spaces and the interpretation of insertions, as far as I understand the following examples :

> agrep("x",c("x","xy","xyz","xyza"),max=list(all=1))
[1] 1 2 3 4
> agrep("x ",c("x ","xy ","xyz ","xyza"),max=list(all=1))
[1] 1
> agrep("xx",c("xx","xyx","xyzx","xyzax",max=list(all=1)))
[1] 1 2 3 4
> agrep("xx",c("xx","xyx","xyzx","xyzax",max=list(ins=1)))
[1] 1 2 3 4
> agrep("xx ",c("xx ","xyx ","xyzx ","xyzax",max=list(all=2)))
[1] 1
> agrep("xx ",c("xx ","xyx ","xyzx ","xyzax",max=list(all=3)))
[1] 1

If the sequences are made the same length in spaces, this function gives the expected result in the second example, but it definitely doesn't do that any more when you start playing around with insertions. If not a bug, it definitely behaves pretty weird...

Cheers
Joris

On Wed, Nov 17, 2010 at 4:49 PM, Dickison, Daniel <ddickison_at_carnegielearning.com> wrote:
> I posted this yesterday to r-help and Ben Bolker suggested reposting it
> here...
>
> Dickison, Daniel <ddickison <at> carnegielearning.com> writes:
>
>>
>> The documentation for agrep says it uses the Levenshtein edit distance,
>> but it seems to get this wrong in certain cases when there is a
>> combination of deletions and substitutions.  For example:
>>
>> > agrep("abcd", "abcxyz", max.distance=1)
>> [1] 1
>>
>> That should've been a no-match.  The edit distance between those strings
>> is 3 (1 substitution, 2 deletions), but agrep matches with max.distance
>>>=
>> 1.
>>
>> I didn't find anything in the bug database, so I was wondering if somehow
>> I'm misinterpreting how agrep works.  If not, should I file this in
>> Bugzilla?
>>
>
>  Could you re-post this on r-devel?  It definitely sounds like
> this is worth following up.  Based on a little bit of playing around,
> it's quite clear that I don't understand what's going on.  The examples
> show things like
>
> agrep("lasy","lazy",max=list(sub=0))
>
>  which makes sense, but
>
> agrep("lasy","lazybc",max=1)
> agrep("lasy","lazybc",max=0.001)
> agrep("lasy","layt",max=list(all=1))
>
> and
>
> agrep("x",c("x","xy","xyz","xyza"),max=list(insertions=2))
> agrep("x",c("x","xy","xyz","xyza"),max=list(deletions=2))
> agrep("x",c("x","xy","xyz","xyza"),max=list(all=2))
>
>  all give "1 2 3 4" ??
>
>  this makes it clear that I really don't understand what's going on
> based on the documentation.  I tried to trace into the C code
> (which calls functions from the TRE regexp library) but that didn't
> help much ...
>
>
>
> Daniel  Dickison
> Research Programmer
> ddickison_at_carnegielearning.com
> Toll Free: (888) 851-7094 x103
> FAX: (412) 690-2444
>
> Revolutionary Math Curricula. Revolutionary Results.
>
> Carnegie Learning, Inc. | 437 Grant St. 20th Floor | Pittsburgh, PA 15219
> www.carnegielearning.com
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
Joris.Meys_at_Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Wed 17 Nov 2010 - 16:51:43 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 17 Nov 2010 - 20:40:23 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive