[Rd] 'partial' in sort() inefficient?

From: Deepayan Sarkar <deepayan.sarkar_at_gmail.com>
Date: Fri 25 Nov 2005 - 20:25:50 GMT


I often need to work with large vectors whose distribution I want to summarize by Q-Q plots. Since the vectors are large, I use a subset of quantiles, e.g.

quantile(x, probs = ppoints(1000))

Unfortunately, this seemed to be taking too long for large x (much longer than 'sort'). I initially thought maybe quantile was doing something sophisticated (which I don't really need with a large data set), so I would write something simple myself. I did, but then I noticed that 'quantile' was doing essentially the same thing, with the exception that it was calling 'sort' with a non-null 'partial' argument.

?sort says:

     If 'partial' is not 'NULL', it is taken to contain indices of
     elements of 'x' which are to be placed in their correct positions
     by partial sorting.  After the sort, the values specified in
     'partial' are in their correct position in the sorted array.  Any
     values smaller than these values are guaranteed to have a smaller
     index in the sorted array and any values which are greater are
     guaranteed to have a bigger index in the sorted array.  This is
     included for efficiency, and many of the options are not available
     for partial sorting.

However, rather than being efficient, this seems to considerably slow things down (I haven't checked memory efficiency). Consider the following code (imitating what 'quantile' does by default):

probs <- ppoints(1000)

for (i in seq(5000, 50000, 5000))
{

     x <- rnorm(i)
     n <- length(x)
     index <- 1 + (n - 1) * probs
     lo <- floor(index)
     hi <- ceiling(index)
     keep <- as.integer(unique(c(lo, hi)))
     cat(system.time(y1 <- sort(x, partial = keep))[1])
     cat("\t")
     cat(system.time(y2 <- sort(x))[1])
     cat("\t")
     cat(round(max(abs( y1 - y2 )), digits = 3))
     cat("\t")
     cat(max(abs( y1[keep] - y2[keep] )))
     cat("\n")

}

The first two columns in the output are timings for 'sort' with and without 'partial', the last two columns are just a rough check that partial sorting is doing what it claims. With R-2.2:

0.78    0       0.031   0
1.64    0.01    0.565   0
2.59    0.01    0.646   0
3.44    0.01    0.487   0
4.4     0.01    0.569   0
5.26    0.01    0.642   0
6.29    0.01    1.071   0
7.18    0.02    0.566   0
8.18    0.02    1.094   0
9.01    0.03    0.89    0

> version
_ platform i686-pc-linux-gnu arch i686 os linux-gnu

system i686, linux-gnu
status Patched
major 2
minor 2.0
year 2005
month 10
day 16
svn rev 35911
language R

This also holds on (a faster machine running) r-devel

0.39    0       0.176   0
0.85    0.01    0.62    0
1.32    0       0.193   0
1.8     0       0.949   0
2.29    0       0.73    0
2.77    0       1.185   0
3.25    0.01    0.813   0
3.75    0.01    1.171   0
4.21    0.01    0.827   0
4.74    0.01    0.372   0

> version
_ platform x86_64-unknown-linux-gnu arch x86_64 os linux-gnu

system x86_64, linux-gnu
status Under development (unstable)
major 2
minor 3.0
year 2005
month 11
day 25
svn rev 36468
language R

Speed improves when the number of 'partial' indices is small, but even for 10 indices plain 'sort' is faster.

Am I missing something? I haven't checked if NA's make a difference or if there might be memory usage issues, but even so, could 'quantile' at least have an option to disable partial sorting?

Deepayan



R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat Nov 26 07:34:22 2005

This archive was generated by hypermail 2.1.8 : Sat 26 Nov 2005 - 02:21:11 GMT