# [Rd] 'partial' in sort() inefficient?

From: Deepayan Sarkar <deepayan.sarkar_at_gmail.com>
Date: Fri 25 Nov 2005 - 20:25:50 GMT

I often need to work with large vectors whose distribution I want to summarize by Q-Q plots. Since the vectors are large, I use a subset of quantiles, e.g.

quantile(x, probs = ppoints(1000))

Unfortunately, this seemed to be taking too long for large x (much longer than 'sort'). I initially thought maybe quantile was doing something sophisticated (which I don't really need with a large data set), so I would write something simple myself. I did, but then I noticed that 'quantile' was doing essentially the same thing, with the exception that it was calling 'sort' with a non-null 'partial' argument.

?sort says:

```     If 'partial' is not 'NULL', it is taken to contain indices of
elements of 'x' which are to be placed in their correct positions
by partial sorting.  After the sort, the values specified in
'partial' are in their correct position in the sorted array.  Any
values smaller than these values are guaranteed to have a smaller
index in the sorted array and any values which are greater are
guaranteed to have a bigger index in the sorted array.  This is
included for efficiency, and many of the options are not available
for partial sorting.

```

However, rather than being efficient, this seems to considerably slow things down (I haven't checked memory efficiency). Consider the following code (imitating what 'quantile' does by default):

probs <- ppoints(1000)

for (i in seq(5000, 50000, 5000))
{

```     x <- rnorm(i)
n <- length(x)
index <- 1 + (n - 1) * probs
lo <- floor(index)
hi <- ceiling(index)
keep <- as.integer(unique(c(lo, hi)))
cat(system.time(y1 <- sort(x, partial = keep))[1])
cat("\t")
cat(system.time(y2 <- sort(x))[1])
cat("\t")
cat(round(max(abs( y1 - y2 )), digits = 3))
cat("\t")
cat(max(abs( y1[keep] - y2[keep] )))
cat("\n")
```

}

The first two columns in the output are timings for 'sort' with and without 'partial', the last two columns are just a rough check that partial sorting is doing what it claims. With R-2.2:

```0.78    0       0.031   0
1.64    0.01    0.565   0
2.59    0.01    0.646   0
3.44    0.01    0.487   0
4.4     0.01    0.569   0
5.26    0.01    0.642   0
6.29    0.01    1.071   0
7.18    0.02    0.566   0
8.18    0.02    1.094   0
9.01    0.03    0.89    0

> version

_
platform i686-pc-linux-gnu
arch     i686
os       linux-gnu
```

system i686, linux-gnu
status Patched
major 2
minor 2.0
year 2005
month 10
day 16
svn rev 35911
language R

This also holds on (a faster machine running) r-devel

```0.39    0       0.176   0
0.85    0.01    0.62    0
1.32    0       0.193   0
1.8     0       0.949   0
2.29    0       0.73    0
2.77    0       1.185   0
3.25    0.01    0.813   0
3.75    0.01    1.171   0
4.21    0.01    0.827   0
4.74    0.01    0.372   0

> version

_
platform x86_64-unknown-linux-gnu
arch     x86_64
os       linux-gnu
```

system x86_64, linux-gnu
status Under development (unstable)
major 2
minor 3.0
year 2005
month 11
day 25
svn rev 36468
language R

Speed improves when the number of 'partial' indices is small, but even for 10 indices plain 'sort' is faster.

Am I missing something? I haven't checked if NA's make a difference or if there might be memory usage issues, but even so, could 'quantile' at least have an option to disable partial sorting?

Deepayan

R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat Nov 26 07:34:22 2005

This archive was generated by hypermail 2.1.8 : Sat 26 Nov 2005 - 02:21:11 GMT