Re: [R] Newbie: Using R to analyse Apache logs

From: Zembower, Kevin <kzembowe_at_jhuccp.org>
Date: Thu, 31 Jan 2008 09:21:43 -0500

Raj,

I've been experimenting with R to compute simple statistics from my web logs somewhat similar to what you're describing. For instance, I'm working on trying to classify a unique IP or domain name requestor as 'human' or 'robot' based on the number of seconds between requests for pages. I've found that the easiest method of work, given my (elementary) knowledge of R and my (professional) knowledge of perl, is to run my logs through a perl program to pre-process the data, before submitting it to R. The output of running my Apache web log through my perl program looks like this tab-delimited output:
kevinz_at_cn2:~/weblogstats$ ./weblogtimediff.pl access_log.20071130.sorted |head

DateTime        Source  TimeDiff        Type
30/Nov/2007 00:00:47    54.100.68.58.sikkanet.com       15      unknown
30/Nov/2007 00:00:48    54.100.68.58.sikkanet.com       1       unknown
30/Nov/2007 00:01:19    54.100.68.58.sikkanet.com       31      unknown
30/Nov/2007 00:01:25    54.100.68.58.sikkanet.com       6       unknown
30/Nov/2007 00:01:29    ip-61-14-181-116.asianetcom.net 15      unknown
30/Nov/2007 00:01:40    54.100.68.58.sikkanet.com       15      unknown
30/Nov/2007 00:01:41    54.100.68.58.sikkanet.com       1       unknown
30/Nov/2007 00:01:44    llf520049.crawl.yahoo.net       14      robot
30/Nov/2007 00:01:46    ip-61-14-181-116.asianetcom.net 17      unknown
kevinz_at_cn2:~/weblogstats$

In this, I also make a preliminary classification into 'robot' (because it identified itself as such in the browser field), 'human' (because it submitted a text string to my internal search engine), or 'unknown'.

Unfortunately, this approach doesn't seem to be working. The distributions of both the 'humans' and 'robots' seemed to be Poisson by inspection. I therefore created box plots of the log(mean(time intervals)), but the 'humans' versus the 'robots' were indistinguishable by inspection. As this is not exactly what I'm paid to do, I just play with this on my spare time, so I haven't tried anything else yet.

If it's of general interest to this group, I'd be happy to publish my program for this. Otherwise, Raj, if you're interested, I'd be happy to send it to you privately.

One oddity I noted is that Apache logs are not always in chronological order. The date/time stamp is when the request occurred, but it's written in the log when the request is completed. Thus, for a long download, several, shorter subsequent downloads may have been requested and completed before the earlier, long one. I was confused by negative time differences from my program until I discovered this. Subsequently, I sort my Apache log in chronological order before passing it through my program.

Hope this helps. Let me know if you have any other questions.

-Kevin

-----Original Message-----

From: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-project.org] On Behalf Of Raj Mathur
Sent: Thursday, January 31, 2008 8:31 AM To: r-help
Subject: [R] Newbie: Using R to analyse Apache logs

hits=-2.5 tests=BAYES_00,FORGED_RCVD_HELO X-USF-Spam-Flag: NO

Hi,

I have a requirement to scan Apache logs and discover ``exceptions''. Exceptions can be of two types:

  1. A single IP generating a large amount of traffic within a given time frame (for definable values of ``large'' and ``time frame'').
  2. A single IP hitting a wide set of URLs on the server (indicates a crawler), again for definable values of ``wide''.

I'm a complete newbie to R (and to statistics), so the questions are:

Data massaging, tuning, etc. are not an issue. We'd be dealing with a few
hundred thousand or a million records a day.

Regards,


R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 31 Jan 2008 - 14:32:12 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 01 Feb 2008 - 05:30:11 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive