# [Rd] Suggestions to speed up median() and has.na()

From: Henrik Bengtsson <hb_at_maths.lth.se>
Date: Mon 10 Apr 2006 - 17:37:54 GMT

Hi,

This is what the functions look like today:

> median

function (x, na.rm = FALSE)
{

if (is.factor(x) || mode(x) != "numeric")

stop("need numeric data")
if (na.rm)

x <- x[!is.na(x)]
else if (any(is.na(x)))

return(NA)
n <- length(x)
if (n == 0)

return(NA)
half <- (n + 1)/2
if (n%%2 == 1) {

sort(x, partial = half)[half]
}
else {

```        sum(sort(x, partial = c(half, half + 1))[c(half, half +
1)])/2
```

}
}
<environment: namespace:stats>

x <- rnorm(10e6)
system.time(median(x))/system.time(median2(x))

where median2() is the function with the above replacements, gives about 20-25% speed up.

An poor mans alternative to (2), is to have a third alternative to 'na.rm', say, NA, which indicates that we know that there are no NAs in 'x'.

The original median() is approx 50% slower (naive benchmarking) than a version with the above two improvements, if passing a large 'x' with no NAs;

median2 <- function (x, na.rm = FALSE) {

if (is.factor(x) || mode(x) != "numeric")

stop("need numeric data")

if (is.na(na.rm)) {
} else if (na.rm)

x <- x[!is.na(x)]
else if (any(is.na(x)))

return(NA)

n <- length(x)
if (n == 0)

return(NA)
half <- (n + 1)/2
if (n%%2 == 1) {

.Internal(psort(x, half))[half]
}
else {

sum(.Internal(psort(x, c(half, half + 1)))[c(half, half + 1)])/2     }
}

x <- rnorm(10e5)
K <- 10
t0 <- system.time({
for (kk in 1:K)
y <- median(x);
})
print(t0) #  1.82 0.14 1.98 NA NA t1 <- system.time({
for (kk in 1:K)
y <- median2(x, na.rm=NA);
})
print(t1) #  1.25 0.06 1.34 NA NA print(t0/t1) #  1.456000 2.333333 1.477612 NA NA

/Henrik

R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue Apr 11 04:09:03 2006

This archive was generated by hypermail 2.1.8 : Tue 11 Apr 2006 - 00:17:08 GMT