Re: [R] Fast R implementation of Gini mean difference

About this list Date view Thread view Subject view Author view Attachment view

From: Adelchi Azzalini (azzalini@stat.unipd.it)
Date: Thu 24 Apr 2003 - 23:41:42 EST


Message-id: <20030424134143.66C927CA824@tango.stat.unipd.it>

On Thursday 24 April 2003 05:17, you wrote:
> I have written the following function to calculate the weighted mean
> difference for univariate data (see
> http://www.xycoon.com/gini_mean_difference.htmáfor a related
> formula). Unsurprisingly, the function is slow (compared to sd or mad)
> for long vectors. I wonder if there's a way to make the function
> faster, short of creating an external C function. Thanks very much
> for your advice.
>
>
> gmd <- function(x, w) { # x=data vector, w=weights vector
> á án <- length(x)
> á átmp <- 0
> á áfor (i in 1:n) {
> á á á for (j in 1:n) {
> á á á á átmp <- tmp + w[i]*abs(x[i]-x[j])
> á á á }
> á á}
> á áretval <- 0.5*sqrt(pi)*tmp/((n-1)*sum(w))
> }

I have a few remarks on the "definition" of Gini index, and one on the most
suitable way to compute it.

(a) there no 0.5*sqrt(pi) neither in the standard definition of the index, nor
   in the quoted source, at http://www.xycoon.com/gini_mean_difference.htm
(b) when the values of x vector have frequencies, then the common
   definition knwon to me is not the above one, but one such that the nested line
   above should be replaced by
á á á á átmp <- tmp + w[i]*w[j]*abs(x[i]-x[j])
   and the final assignment should be
          0.5*sqrt(pi)*tmp/((sum(w)-1)*sum(w))
  In fact, the above function produces inconsident results; see this:
> gmd(c(1,2,2,4),c(1,1,1,1))
[1] 1.329
> gmd(c(1,2,4), c(1,2,1))
[1] 1.662

while the above modifications lead again to
> gmd(c(1,2,4), c(1,2,1))
[1] 1.329

(c) from the computational viewpoint, there is a much more convenient equivalent
   expression, based on the order statistics, leading to the following function:

gini.md<- function(x)
  { # x=data vector
    j <-order(x)
    n <-length(x)
    return(4*sum((1:length(x))*x[j])/(n*(n-1))
          -2*mean(x)*(n+1)/(n-1))
  }

This is intendend for the basic case of unreplicated data. With some algebraic work
one should be able to obtain a similar expression for the general case ...once an
agreement has been achieved on the definition of the index.

Clearly, to convert the output from gini.md() to gmd..() one must multiply by
0.5*sqrt(pi).

Here is a short comparison of timing, following the one prepared by Dan Nordlund.

> x<-rnorm(5000)
> n<-rpois(5000,5)
> w<-n/sum(n)

> system.time(gmd(x, w))
[1] 241.44 0.21 250.20 0.00 0.00
> system.time(gmd.1(x, w))
[1] 5.01 8.69 129.95 0.00 0.00
> system.time(gmd.2(x, w))
[1] 1.21 1.55 10.19 0.00 0.00
> system.time(gini.md(x))
[1] 0 0 0 0 0

also, the new function is requires little memory

regards

Adelchi Azzalini

-- 
Adelchi Azzalini  <azzalini@stat.unipd.it>
Dipart.Scienze Statistiche, UniversitÓ di Padova, Italia
http://azzalini.stat.unipd.it/

______________________________________________ R-help@stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help


About this list Date view Thread view Subject view Author view Attachment view

This archive was generated by hypermail 2.1.3 : Tue 01 Jul 2003 - 09:11:43 EST