**From:** Adelchi Azzalini (*azzalini@stat.unipd.it*)

**Date:** Thu 24 Apr 2003 - 23:41:42 EST

**Next message:**Philippe Grosjean: "RE: [R] Windows: Graphics appear only partially"**Previous message:**Christian Hennig: "Re: AW: [R] estimating number of clusters ("Null or more")"**In reply to:**Andrew C. Ward: "[R] Fast R implementation of Gini mean difference"**Next in thread:**Adelchi Azzalini: "Re: [R] Fast R implementation of Gini mean difference"

Message-id: <20030424134143.66C927CA824@tango.stat.unipd.it>

On Thursday 24 April 2003 05:17, you wrote:

*> I have written the following function to calculate the weighted mean
*

*> difference for univariate data (see
*

*> http://www.xycoon.com/gini_mean_difference.htm for a related
*

*> formula). Unsurprisingly, the function is slow (compared to sd or mad)
*

*> for long vectors. I wonder if there's a way to make the function
*

*> faster, short of creating an external C function. Thanks very much
*

*> for your advice.
*

*>
*

*>
*

*> gmd <- function(x, w) { # x=data vector, w=weights vector
*

*> n <- length(x)
*

*> tmp <- 0
*

*> for (i in 1:n) {
*

*> for (j in 1:n) {
*

*> tmp <- tmp + w[i]*abs(x[i]-x[j])
*

*> }
*

*> }
*

*> retval <- 0.5*sqrt(pi)*tmp/((n-1)*sum(w))
*

*> }
*

I have a few remarks on the "definition" of Gini index, and one on the most

suitable way to compute it.

(a) there no 0.5*sqrt(pi) neither in the standard definition of the index, nor

in the quoted source, at http://www.xycoon.com/gini_mean_difference.htm

(b) when the values of x vector have frequencies, then the common

definition knwon to me is not the above one, but one such that the nested line

above should be replaced by

tmp <- tmp + w[i]*w[j]*abs(x[i]-x[j])

and the final assignment should be

0.5*sqrt(pi)*tmp/((sum(w)-1)*sum(w))

In fact, the above function produces inconsident results; see this:

*> gmd(c(1,2,2,4),c(1,1,1,1))
*

[1] 1.329

*> gmd(c(1,2,4), c(1,2,1))
*

[1] 1.662

while the above modifications lead again to

* > gmd(c(1,2,4), c(1,2,1))
*

[1] 1.329

(c) from the computational viewpoint, there is a much more convenient equivalent

expression, based on the order statistics, leading to the following function:

gini.md<- function(x)

{ # x=data vector

j <-order(x)

n <-length(x)

return(4*sum((1:length(x))*x[j])/(n*(n-1))

-2*mean(x)*(n+1)/(n-1))

}

This is intendend for the basic case of unreplicated data. With some algebraic work

one should be able to obtain a similar expression for the general case ...once an

agreement has been achieved on the definition of the index.

Clearly, to convert the output from gini.md() to gmd..() one must multiply by

0.5*sqrt(pi).

Here is a short comparison of timing, following the one prepared by Dan Nordlund.

*> x<-rnorm(5000)
*

*> n<-rpois(5000,5)
*

*> w<-n/sum(n)
*

*> system.time(gmd(x, w))
*

[1] 241.44 0.21 250.20 0.00 0.00

*> system.time(gmd.1(x, w))
*

[1] 5.01 8.69 129.95 0.00 0.00

*> system.time(gmd.2(x, w))
*

[1] 1.21 1.55 10.19 0.00 0.00

*> system.time(gini.md(x))
*

[1] 0 0 0 0 0

also, the new function is requires little memory

regards

Adelchi Azzalini

-- Adelchi Azzalini <azzalini@stat.unipd.it> Dipart.Scienze Statistiche, Universitą di Padova, Italia http://azzalini.stat.unipd.it/______________________________________________ R-help@stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help

**Next message:**Philippe Grosjean: "RE: [R] Windows: Graphics appear only partially"**Previous message:**Christian Hennig: "Re: AW: [R] estimating number of clusters ("Null or more")"**In reply to:**Andrew C. Ward: "[R] Fast R implementation of Gini mean difference"**Next in thread:**Adelchi Azzalini: "Re: [R] Fast R implementation of Gini mean difference"

*
This archive was generated by hypermail 2.1.3
: Tue 01 Jul 2003 - 09:11:43 EST
*