[Rd] Bug in agrep computing edit distance?

From: Dickison, Daniel <ddickison_at_carnegielearning.com>
Date: Wed, 17 Nov 2010 10:49:24 -0500


I posted this yesterday to r-help and Ben Bolker suggested reposting it here...

Dickison, Daniel <ddickison <at> carnegielearning.com> writes:

>
> The documentation for agrep says it uses the Levenshtein edit distance,
> but it seems to get this wrong in certain cases when there is a
> combination of deletions and substitutions. For example:
>
> > agrep("abcd", "abcxyz", max.distance=1)
> [1] 1
>
> That should've been a no-match. The edit distance between those strings
> is 3 (1 substitution, 2 deletions), but agrep matches with max.distance
>>=
> 1.
>
> I didn't find anything in the bug database, so I was wondering if somehow
> I'm misinterpreting how agrep works. If not, should I file this in
> Bugzilla?
>

  Could you re-post this on r-devel? It definitely sounds like this is worth following up. Based on a little bit of playing around, it's quite clear that I don't understand what's going on. The examples show things like

agrep("lasy","lazy",max=list(sub=0))

 which makes sense, but

agrep("lasy","lazybc",max=1)
agrep("lasy","lazybc",max=0.001)
agrep("lasy","layt",max=list(all=1))

and

agrep("x",c("x","xy","xyz","xyza"),max=list(insertions=2))
agrep("x",c("x","xy","xyz","xyza"),max=list(deletions=2))
agrep("x",c("x","xy","xyz","xyza"),max=list(all=2))

  all give "1 2 3 4" ??

  this makes it clear that I really don't understand what's going on based on the documentation. I tried to trace into the C code (which calls functions from the TRE regexp library) but that didn't help much ...

Daniel Dickison
Research Programmer
ddickison_at_carnegielearning.com
Toll Free: (888) 851-7094 x103
FAX: (412) 690-2444 Revolutionary Math Curricula. Revolutionary Results.

Carnegie Learning, Inc. | 437 Grant St. 20th Floor | Pittsburgh, PA 15219 www.carnegielearning.com



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 17 Nov 2010 - 15:53:15 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 18 Nov 2010 - 01:10:23 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive