From: ferdinand principia <ferd_principia_at_yahoo.ca>

Date: Wed 29 Jun 2005 - 08:14:07 EST

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Wed Jun 29 08:29:45 2005

Date: Wed 29 Jun 2005 - 08:14:07 EST

Sorry,

I should specifiy in more detail what my data looks like. The data vectors (simulations) are mostly composed of floats (for which it's pretty unlikely to produce ties), but there are integer values to be found as well (up to 10% of vector elements). As I undestand, Marc's algo is not suited for this case.

Is there another solution?

Thanks

Ferdinand

- "Marc Schwartz (via MN)" <mschwartz@mn.rr.com> wrote:

> On Tue, 2005-06-28 at 13:03 -0400, ferdinand

*> principia wrote:
**> > Hi,
**> >
**> > I need to calculate Kendall's tau for large data
**> > vectors (length > 100'000).
**> > Is somebody aware of a faster algorithm or package
**> > function than "cor(, method="kendall")"?
**> > There are ties in the data to be considered
**> (Kendall's
**> > tau-b).
**> >
**> > Any suggestions?
**> >
**> > Regards
**> > Ferdinand
**>
**>
**> The time intensive part of the process is typically
**> the ranking/ordering
**> of the vector pairs to calculate the numbers of
**> concordant and
**> discordant pairs.
**>
**> If the number of _unique pairs_ in your data is
**> substantially less than
**> the number of total pairs (in other words, creating
**> a smaller 2d
**> contingency table from a pair of your vectors makes
**> sense), then the
**> following may be of help.
**>
**> # Calculate CONcordant Pairs in a table
**> # cycle through x[r, c] and multiply by
**> # sum(x elements below and to the right of x[r, c])
**> # x = table
**> concordant <- function(x)
**> {
**> x <- matrix(as.numeric(x), dim(x))
**>
**> # get sum(matrix values > r AND > c)
**> # for each matrix[r, c]
**> mat.lr <- function(r, c)
**> {
**> lr <- x[(r.x > r) & (c.x > c)]
**> sum(lr)
**> }
**>
**> # get row and column index for each
**> # matrix element
**> r.x <- row(x)
**> c.x <- col(x)
**>
**> # return the sum of each matrix[r, c] * sums
**> # using mapply to sequence thru each matrix[r, c]
**> sum(x * mapply(mat.lr, r = r.x, c = c.x))
**> }
**>
**> # Calculate DIScordant Pairs in a table
**> # cycle through x[r, c] and multiply by
**> # sum(x elements below and to the left of x[r, c])
**> # x = table
**> discordant <- function(x)
**> {
**> x <- matrix(as.numeric(x), dim(x))
**>
**> # get sum(matrix values > r AND < c)
**> # for each matrix[r, c]
**> mat.ll <- function(r, c)
**> {
**> ll <- x[(r.x > r) & (c.x < c)]
**> sum(ll)
**> }
**>
**> # get row and column index for each
**> # matrix element
**> r.x <- row(x)
**> c.x <- col(x)
**>
**> # return the sum of each matrix[r, c] * sums
**> # using mapply to sequence thru each matrix[r, c]
**> sum(x * mapply(mat.ll, r = r.x, c = c.x))
**> }
**>
**>
**> # Calculate Kendall's Tau-b
**> # x = table
**> calc.KTb <- function(x)
**> {
**> x <- matrix(as.numeric(x), dim(x))
**>
**> c <- concordant(x)
**> d <- discordant(x)
**>
**> n <- sum(x)
**> SumR <- rowSums(x)
**> SumC <- colSums(x)
**>
**> KTb <- (2 * (c - d)) / sqrt(((n ^ 2) -
**> (sum(SumR ^ 2))) * ((n ^ 2) -
**> (sum(SumC ^ 2))))
**>
**> KTb
**> }
**>
**>
**> Note that I made some modifications of the above,
**> relative to prior
**> versions that I have posted to handle large numbers
**> of pairs to avoid
**> integer overflows in summations. Hence the:
**>
**> x <- matrix(as.numeric(x), dim(x))
**>
**> conversion in each function.
**>
**> Now, create some random test data, with 100,000
**> elements in each vector,
**> sampling from 'letters', which would yield a 26 x 26
**> table:
**>
**> a <- sample(letters, 100000, replace = TRUE)
**> b <- sample(letters, 100000, replace = TRUE)
**>
**> > dim(table(a, b))
**> [1] 26 26
**>
**> > system.time(print(calc.KTb(table(a, b))))
**> [1] 0.0006906088
**> [1] 0.77 0.02 0.83 0.00 0.00
**>
**> Note that in the above, the initial table takes most
**> of the time:
**>
**> > system.time(table(a, b))
**> [1] 0.55 0.00 0.56 0.00 0.00
**>
**> Hence:
**>
**> > tab.ab <- table(a, b)
**> > system.time(print(calc.KTb(tab.ab)))
**> [1] 0.0006906088
**> [1] 0.25 0.01 0.27 0.00 0.00
**>
**>
**> I should note that I also ran:
**>
**> > system.time(print(cor(a, b, method = "kendall")))
**> [1] 0.0006906088
**> [1] 694.80 7.72 931.89 0.00 0.00
**>
**> Nice to know the results work out at least... :-)
**>
**>
**> I have not tested with substantially larger 2d
**> matrices, but would
**> envision that as the dimensions of the resultant
**> tabulation increases,
**> my method probably approaches and may even become
**> less efficient than
**> the approach implemented in cor(). Some testing
**> would validate this and
**> perhaps point to coding the concordant() and
**> discordant() functions in C
**> for improvement in timing.
**>
**> HTH,
**>
**> Marc Schwartz
**>
**>
*

>

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Wed Jun 29 08:29:45 2005

*
This archive was generated by hypermail 2.1.8
: Fri 03 Mar 2006 - 03:33:06 EST
*