[R] glob2rx() {was: no bug in R2.1.0's list.files()}

From: Martin Maechler <maechler_at_stat.math.ethz.ch>
Date: Fri 13 May 2005 - 04:19:13 EST

>>>>> "BaRow" == Barry Rowlingson <B.Rowlingson@lancaster.ac.uk> >>>>> on Thu, 12 May 2005 11:05:43 +0100 writes:

    BaRow> Uwe Ligges wrote:
>> Please read about regular expressions (!!!) and try to
>> understand that ".txt" also finds "Not_a_txt_file.xls"
>> ....

    BaRow>   The confusion here is between regular expressions
    BaRow> and wildcard expansion known as 'globbing'. The two
    BaRow> things are very different, and use characters such as
    BaRow> '*' '.' and '?' in different ways.

Exactly, I had devised a "glob" to "regexp" function many years ago in order to help newbies make the transition.

That function, nowadays, called 'glob2rx' has been part of our (CRAN) package "sfsmisc" and hence available to all via  

       install.packages("sfsmisc")
       library("sfsmisc")

But it's quite simple (though not trivial to read for the inexperienced because of the many escapes ("\") needed) and it maybe helpful to see its code on R-help, below. Then, this topic has lead me to add 2 (obvious in hindsight) logical optional arguments to the function so that it now looks like

glob2rx <- function(pattern, trim.head = FALSE, trim.tail = TRUE) {

    ## Purpose: Change "ls" aka "wildcard" aka "globbing" _pattern_ to
    ##	      Regular Expression (as in grep, perl, emacs, ...)
    ## -------------------------------------------------------------------------
    ## Author: Martin Maechler ETH Zurich, ~ 1991
    ##	       New version using [g]sub() : 2004
    p <- gsub('\\.','\\\\.', paste('^', pattern, '$', sep=''))
    p <- gsub('\\?',	 '.',  gsub('\\*',  '.*', p))
    ## these are trimming '.*$' and '^.*' - in most cases only for esthetics     if(trim.tail) p <- sub("\\.\\*\\$$", '', p)     if(trim.head) p <- sub("\\^\\.\\*", '', p)     p
}

So those confused newbies (and DOS long timers!) could use

      list.files(myloc, glob2rx("*.zip"), full=TRUE)

            ## (yes, make a habit of using 'TRUE', not 'T' ..)

The current example code, BTW, has

    stopifnot(glob2rx("abc.*") == "^abc\\.",

               glob2rx("a?b.*") == "^a.b\\.",
               glob2rx("a?b.*", trim.tail=FALSE) == "^a.b\\..*$",
               glob2rx("*.doc") == "^.*\\.doc$",
               glob2rx("*.doc", trim.head=TRUE) == "\\.doc$",
               glob2rx("*.t*")  == "^.*\\.t",
               glob2rx("*.t??") == "^.*\\.t..$"
     )


Martin Maechler,
ETH Zurich

    BaRow>   There's added confusion when people come from a DOS
    BaRow> background, where commands did their own thing when
    BaRow> given '*' as parameter. The DOS command:

    BaRow> RENAME *.FOO *.BAR

    BaRow>   did what seems obvious, renaming all the .FOO files
    BaRow> to .BAR, but on a unix machine doing this with 'mv'
    BaRow> can be destructive!

    BaRow>   In short (and slightly simplified), a '*' when
    BaRow> expanded as a wildcard in a glob matches any string,
    BaRow> whereas a '*' in a regular expression (regexp),
    BaRow> matches the previous character 0 or more times. This
    BaRow> is why "*.zip" is flagged as invalid now - there's no     BaRow> character before the "*".

    BaRow> That should be enough clues to send you on your     BaRow> way.

    BaRow> Baz



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Fri May 13 04:31:46 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:31:45 EST