[R] Histograms with strings, grouped by repeat count (w/ data)

From: Matthew Trunnell <trunnell_at_cognix.net>
Date: Mon, 18 Jun 2007 18:07:42 -0700


Hello R gurus,

I just spent my first weekend wrestling with R, but so far have come up empty handed.

I have a dataset that represents file downloads; it has 4 dimensions: date, filename, email, and country. (sample data below)

My first goal is to get an idea of the frequency of repeated downloads. Let me explain that. Some people tend to download multiple times, e.g. if the download fails they keep trying over and over. I'm trying to build a histogram that shows the repeat count along the x-axis, that is, how many people downloaded once, twice, three times, etc. I plan to compare the median of that before and after we switched ISPs.

To accomplish this, I'm assuming that I'll first need to combine the email and filename columns so as to represent a single download attempt by an individual. Does that sound right? Later, it would be nice to limit the histogram to a single filename, country, or company.  I can probably figure that out myself after I understand how to write this funky histogram expression.

With the help of Verzani's introductory text, I've learned how to read in the CSV data and do some simple tables, like this:

hist(table(d$filename))
hist(table(d$filename[substring(d$filename, 1, 5)=="file1"]))
hist(sort(table(d$filename[substring(d$filename, 1, 5)=="file1"])))

Obviously, these commands count the frequency of the files. What I'd like to see are the repeats grouped along the x-axis; I'd like to find, for all files, the distribution of retries. I hope that makes sense. :)

Can someone point me in the right direction? I'm very new to R and to statistics, but I write code for a living. At this point I'd almost be better off writing a program do this kind of simple counting... but I have a feeling R would be so useful if I could just get past the initial learning curve.

Thank you in advance,
Matt

Here's some real data, with the private info replaced :)

 d<-read.table(file="C:\\users\\trunnellm\\downloads\\statistics\\downloads.csv", sep=",", quote="\"", header=TRUE)

filename,last_modified,email_addr,country_residence

file1,3/4/2006 13:54,email1,Korea (South)
file2,3/4/2006 14:33,email2,United States
file2,3/4/2006 16:03,email2,United States
file2,3/4/2006 16:17,email3,United States
file2,3/4/2006 16:28,email3,United States
file3,3/4/2006 19:13,email4,United States
file2,3/4/2006 21:22,email5,India
file4,3/4/2006 21:46,email6,United States
file1,3/4/2006 22:04,email7,Japan
file2,3/4/2006 22:09,email8,Croatia
file1,3/4/2006 22:22,email7,Japan
file1,3/4/2006 22:29,email9,India
file1,3/4/2006 23:06,email6,United States
file1,3/4/2006 23:33,email6,United States
file5,3/4/2006 23:44,email10,China
file1,3/5/2006 0:13,email9,India
file2,3/5/2006 0:52,email8,Croatia
file2,3/5/2006 0:54,email8,Croatia
file2,3/5/2006 1:10,email5,India
file6,3/5/2006 2:17,email9,India
file2,3/5/2006 2:24,email11,Italy
file7,3/5/2006 2:36,email12,Italy
file8,3/5/2006 2:52,email12,Italy
file2,3/5/2006 3:09,email13,United Kingdom
file2,3/5/2006 4:02,email14,India
file2,3/5/2006 4:07,email14,India
file2,3/5/2006 4:14,email14,India
file2,3/5/2006 4:37,email5,India
file2,3/5/2006 4:44,email15,Belgium
file1,3/5/2006 5:02,email9,India
file1,3/5/2006 5:24,email16,Taiwan
file2,3/5/2006 6:06,email17,Saudi Arabia
file2,3/5/2006 7:32,email17,Saudi Arabia
file2,3/5/2006 8:12,email18,Brazil
file2,3/5/2006 8:26,email18,Brazil
file2,3/5/2006 9:49,email19,United Kingdom
file1,3/5/2006 10:49,email11,Italy
file1,3/5/2006 11:16,email13,United Kingdom
file1,3/5/2006 11:16,email13,United Kingdom
file1,3/5/2006 11:45,email13,United Kingdom
file1,3/5/2006 14:34,email20,Australia
file9,3/5/2006 14:56,email20,Australia
file9,3/5/2006 14:56,email20,Australia
file5,3/5/2006 16:43,email21,United States
file1,3/5/2006 17:17,email7,Japan
file2,3/5/2006 17:26,email22,Japan
file2,3/5/2006 17:27,email22,Japan
file2,3/5/2006 17:33,email23,China
file1,3/5/2006 17:45,email22,Japan
file2,3/5/2006 17:45,email22,Japan
file2,3/5/2006 17:59,email23,China
file1,3/5/2006 18:27,email24,Japan
file1,3/5/2006 18:47,email25,Taiwan
file2,3/5/2006 18:48,email26,New Zealand
file2,3/5/2006 19:15,email27,Canada
file2,3/5/2006 19:23,email28,Canada
file2,3/5/2006 19:24,email28,Canada
file10,3/5/2006 19:49,email29,Japan

file10,3/5/2006 19:52,email29,Japan
file10,3/5/2006 19:57,email29,Japan
file2,3/5/2006 20:01,email29,Japan
file2,3/5/2006 20:02,email29,Japan
file2,3/5/2006 20:06,email29,Japan

______________________________________________
R-help_at_stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Tue 19 Jun 2007 - 01:12:42 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 19 Jun 2007 - 02:32:13 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.