Re: [Rd] CRAN Server download statistics (Was: R Usage Statistics)

From: Fellows, Ian <ifellows_at_ucsd.edu>
Date: Mon, 23 Nov 2009 12:51:08 -0800

Thank you all for the interesting discussion.

I'm sensitive to the privacy issue. And am happy to work with any administrators to come up with a solution that works for them. We do not plan on having the raw logs be publicly available, so the IP addresses of the users would be known only to the originating CRAN server and UCLA. If there are concerns about UCLA having access to that information, the logs can be preprocessed on the CRAN server, and IP addresses hashed (though this would be a little more work for the administrator). Each month, non-identifiable summary statistics will be made available for public download and analysis.

Hadley,

The site is just a bit of a sketch at the moment. 1. You are right, the continent information should not be on the front page.

2. Though the number of downloads is useful information, the # of IP addresses is probably more representative of package popularity. When I did a static analysis of a couple of months worth of logs, I found that the package MetaMA had the most downloads, due to one IP address downloading it >1000 times. There is no perfect mapping between the logs and what we would really like to measure which is the unique installations. I think it's best just to present both.

3. The site is sort of ugly right now, those are good suggestions for improvement.

4. The sorted % windows will be replaced by a measure of package hotness, i.e. rate of change of download rate. It is a placeholder for the moment.

5. The time plot is a histogram of date, it looks like a bar chart because there is only one week of data currently.

6. Regarding package dependancies, I was thinking about also counting the number of top level downloads, as approximated by the number of downloads where a reverse dependancy was not downloaded in the next 5 min by the same IP.

There are a lot of interesting things that are not explored by the site as i stands. For example one of the members of the group is working on creating a MDS representation of the packages, so that we can create a map where packages downloaded by similar IP addresses are clustered, creating a visual representation of the package space.

Ian



From: hadley wickham [h.wickham_at_gmail.com] Sent: Monday, November 23, 2009 6:12 AM
To: Fellows, Ian
Cc: R-devel; Stefan Theussl
Subject: Re: [Rd] CRAN Server download statistics (Was: R Usage Statistics)

Hi Ian,

I've spoken with Stefan Theussl (cran maintainer) about this, and he's concerned about the privacy implications of making the apache access logs public. A compromise that he mentioned was having a script run on the cran mirror that processed the log files and output summary statistics. Then a central process could aggregate these and produce a single overall summary.

A few comments on your current site:

Hadley

On Sun, Nov 22, 2009 at 6:18 PM, Fellows, Ian <ifellows_at_ucsd.edu> wrote:
> Hi All,
>
> It seems that the question of how may people use (or download) R, and it's packages is one that comes up on a fairly regular basis in a variety of forums (There was also recent thread on the subject on Stack Overflow). A couple of students at UCLA (including myself), wanted to address the issue, so we set up a system to get and parse the cran.stat.ucla.edu APACHE logs every night, and display some basic statistics. Right now, we have a working sketch of a site based on one week of observations.
>
> http://neolab.stat.ucla.edu/cranstats/
>
> We would very much like to incorporate data from all CRAN mirrors, including cran.r-project.org. We would also like to set this up in a way that is minimally invasive for the site administrators. Internally, our administrator has set up a protected directory with the last couple days of cran activity. We then pull that down using curl.
>
> What would be the best and easiest way for the CRAN mirrors to share their data? Is the contact information for the administrators available anywhere?
>
>
> Thank you,
> Ian Fellows
>
>
>
> ________________________________________
> From: r-devel-bounces_at_r-project.org [r-devel-bounces_at_r-project.org] On Behalf Of Steven McKinney [smckinney_at_bccrc.ca]
> Sent: Thursday, November 19, 2009 2:21 PM
> To: Kevin R. Coombes; r-devel_at_r-project.org
> Subject: Re: [Rd] R Usage Statistics
>
> Hi Kevin,
>
> What a surprising comment from a reviewer for BMC Bioinformatics.
>
> I just did a PubMed search for "limma" and "aroma.affymetrix",
> just two methods for which I use R software regularly.
> "limma" yields 28 hits, several of which are published
> in BMC Bioinformatics. Bengtsson's aroma.affymetrix paper
> "Estimation and assessment of raw copy numbers at the single locus level."
> is already cited by 6 others.
>
> It almost seems too easy to work up lists of usage of R packages.
>
> Spotfire is an application built around S-Plus that has widespread use
> in the biopharmaceutical industry at a minimum. Vivek Ranadive's
> TIBCO company just purchased Insightful, the S-Plus company.
> (They bought Spotfire previously.)
> Mr. Ranadive does not spend money on environments that are
> not appropriate for deploying applications.
>
> You could easily cull a list of corporation names from the
> various R email listservs as well.
>
> Press back with the reviewer. Reviewers can learn new things
> and will respond to arguments with good evidence behind them.
> Good luck!
>
>
> Steven McKinney
>
>
> ________________________________________
> From: r-devel-bounces_at_r-project.org [r-devel-bounces_at_r-project.org] On Behalf Of Kevin R. Coombes [krcoombes_at_mdacc.tmc.edu]
> Sent: November 19, 2009 10:47 AM
> To: r-devel_at_r-project.org
> Subject: [Rd] R Usage Statistics
>
> Hi,
>
> I got the following comment from the reviewer of a paper (describing an
> algorithm implemented in R) that I submitted to BMC Bioinformatics:
>
> "Finally, which useful for exploratory work and some prototyping,
> neither R nor S-Plus are appropriate environments for deploying user
> applications that would receive much use."
>
> I can certainly respond by pointing out that CRAN contains more than
> 2000 packages and Bioconductor contains more than 350. However, does
> anyone have statistics on how often R (and possibly some R packages) are
> downloaded, or on how many people actually use R?
>
> Thanks,
> Kevin
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

--
http://had.co.nz/

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Mon 23 Nov 2009 - 21:19:35 GMT

This archive was generated by hypermail 2.2.0 : Mon 23 Nov 2009 - 23:30:40 GMT