Re: [R] Regex query (Apache logs)

From: Saptarshi Guha <saptarshi.guha_at_gmail.com>
Date: Thu, 17 Mar 2011 09:08:06 -0700

Hello Allan,

Thanks the response. Provides me hope. I appreciate [3], might even go with that.
And for posterity, here's the code (assuming pastebin never expires)

[1] Test string : http://pastebin.com/FyAFzmTv [2] Pattern (modified as per your suggestion) : http://pastebin.com/s7VT0r5K

pattern <- readLines(url("http://pastebin.com/raw.php?i=s7VT0r5K"), warn=FALSE)
test <- readLines(url("http://pastebin.com/raw.php?i=rbAvR2dK"),warn=FALSE) regexpr(pattern, test, perl=TRUE)

Thanks
Saptarshi

On Thu, Mar 17, 2011 at 12:12 AM, Allan Engelhardt <allane_at_cybaea.com>wrote:

> Some comments:
>
> 1. [^\s] matches everything up to a literal 's', unless perl=TRUE.
> 2. The (.*) is greedy, so you'll need (.*?)"\s"(.*?)"\s"(.*?)"$ or similar
> at the end of the expression
>
> With those changes (and removing a space inserted by the newsgroup posting)
> the expression works for me.
>
> > (pat <- readLines("/tmp/b.txt")[1])
> [1]
> "^(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})\\s([^\\s]*)\\s([^\\s]*)\\s\\[([^\\]]+)\\]\\s\"([A-Z]*)\\s([^\\s]*)\\s([^\\s]*)\"\\s([^\\s]+)\\s(\\d+)\\s\"(.*?)\"\\s\"(.*?)\"\\s\"(.*?)\"$"
> > regexpr(pat, test, perl=TRUE)
> [1] 1
> attr(,"match.length")
> [1] 436
>
> 3. Consider a different approach, e.g. scan(textConnection(test),
> what=character(0))
>
> Hope this helps
>
> Allan
>
>
>
> On 16/03/11 22:18, Saptarshi Guha wrote:
>
>> Hello R users,
>>
>> I have this regex see [1] for apache log lines. I tried using R to parse
>> some data (only because I wanted to stay in R).
>> A sample line is [2]
>>
>> (a) I saved the line in [1] into "~/tmp/a.txt" and [2] into "/tmp/a.txt"
>>
>> pat<- readLines("~/tmp/a.txt")
>> test<- readLines("/tmp/a.txt")
>> test
>> grep(pat,test)
>>
>> returns integer(0)
>>
>> The same query works in python via re.match(....) (i.e does return groups)
>>
>> Using readLines, the regex is escaped for me. Does Python and R use
>> different regex styles?
>>
>> Cheers
>> Saptarshi
>>
>> [1]
>>
>> ^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s([^\s]*)\s([^\s]*)\s\[([^\]]+)\]\s"([A-Z]*)\s([^\s]*)\s([^\s]*)"\s([^\s]+)\s(\d+)\s"(.*)"\s"(.*)"\s"(.*)"$
>>
>> [2]
>> 220.213.119.925 addons.mozilla.org - [10/Jan/2001:01:55:07 -0800] "GET
>>
>> /blocklist/3/%8ce33983c0-fd0e-11dc-12aa-0800200c9a66%7D/4.0b5/Fennec/20110217140304/Android_arm-eabi-gcc3/chrome:%2F%2Fglobal%2Flocale%2Fintl.properties/beta/Linux%
>> 202.6.32.9/default/default/6/6/1/ HTTP/1.1" 200 3243 "-" "Mozilla/5.0
>> (Android; Linux armv7l; rv:2.0b12pre) Gecko/20110217 Firefox/4.0b12pre
>> Fennec/4.0b5" "BLOCKLIST_v3=110.163.217.169.1299218425.9706"
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

        [[alternative HTML version deleted]]



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 17 Mar 2011 - 16:15:37 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 17 Mar 2011 - 17:10:22 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive