Re: [Rd] Bug in agrep computing edit distance?

From: Dickison, Daniel <ddickison_at_carnegielearning.com>
Date: Wed, 17 Nov 2010 18:23:39 -0500


On 11/17/10 6:06 PM, "Joris Meys" <jorismeys_at_gmail.com> wrote:

>Indeed, I get it. If the pattern is "xx", it is only matched against 2
>letters at the same time. All the rest doesn't matter. But still that
>doesn't explain
>
>>agrep("ANNTCG", "ANNXXTCG", max = list(ins=3))
>integer(0)
>>agrep("ANNTCG", "ANNXTCG", max = list(ins=3))
>[1] 1
>>agrep("ANNTCG", "ANTCG", max = list(del=3))
>[1] 1
>>agrep("ANNTCG", "ATCG", max = list(del=3))
>integer(0)

It looks like R's agrep defaults max.distance$all to 0.1 if unspecified by the argument, so that explains these examples (the first and last one have a net distance of 2, which is > ceiling(0.1 * nchar(pattern))).

The attachment is a completely untested fix that turns the pattern into a regex (I haven't yet succeeded in setting up an environment to compile R from source). Since TRE defaults to Basic POSIX regex syntax, in theory only backslashes in the user-provided pattern need to be escaped, and \^ and \$ added to the pattern. Hopefully somebody can review this to see if it looks correct.

Daniel

Daniel Dickison
Research Programmer
ddickison_at_carnegielearning.com
Toll Free: (888) 851-7094 x103
FAX: (412) 690-2444 Revolutionary Math Curricula. Revolutionary Results.

Carnegie Learning, Inc. | 437 Grant St. 20th Floor | Pittsburgh, PA 15219 www.carnegielearning.com



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 17 Nov 2010 - 23:26:35 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 18 Nov 2010 - 16:20:23 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive