[R] Outlier identification according to Hardin & Rocke (1999)

From: Kevin Wright (kwright@eskimo.com)
Date: Thu 27 May 2004 - 05:31:03 EST

```Message-id: <200405261931.MAA11989@eskimo.com>

```

I'm trying to use a paper by Hardin & Rocke: http://handel.cipic.ucdavis.edu/~dmrocke/Robdist5.pdf
as a guide for a function to identify outliers in multivariate data. Attached below is a function that is my attempt to reproduce their method and also a test to see what fraction of the data are identified as outliers. Using this function I am able to reproduce their results regarding the asymptotic chi-square method, but not their new method using an asymptotic F method. In particular, the asymptotic F method (and adjusted F method) seem to give critical distances that are too large and therefore no (or few) points are identified as outliers.

I have tried unsuccessfully to contact the authors of the paper to seek further information and / or numerical examples.

I would be most interested if anybody has used the method of Hardin & Rocke or can take the function I have provided below and modify it to reproduce their results. Note: The mvtnorm package is required.

Best,

Kevin Wright

outliers.id <- function(x, alpha=.05){
# See: Hardin & Rocke (1999), "The Distribution of Robust Distances"
# http://handel.cipic.ucdavis.edu/~dmrocke/Robdist5.pdf

# See: Hardin & Rocke (2002), "Outlier Detection in the Multiple Cluster
# Setting Using the Minimum Covariance Determinant Estimator"
# http://bioinfo.cipic.ucdavis.edu/publications/print_pub?pub_id=736&category=1

# Drop factors first
factors <- names(x)[sapply(x,is.factor)]
if(length(factors)>0)
x <- x[-factors]

# Get the robust location/scale estimates
require(MASS)
covResult <- cov.rob(x)

# Calculate the mahalanobis distance for each datum
distance <- mahalanobis(x,covResult\$center,covResult\$cov)

n <- nrow(x)
p <- ncol(x)
h <- floor((n+p+1)/2)

# Asymptotic chi-square method (page 11)
# Often identifies too many points as outliers
critical.chi <- qchisq(1-alpha, p)
cat("Chi square critical distance:", critical.chi,"\n")

# Now the approximate F method. First estimate c (page 19)
c <- pchisq(qchisq(1-h/n, p),p+2) / (h/n)

# Now to estimate m (page 22)
a <- (n-h)/n
qa <- qchisq(1-a,p)
ca <- (1-a)/pchisq(qa,p+2)
c2 <- -pchisq(qa,p+2)/2
c3 <- -pchisq(qa,p+4)/2
c4 <- 3 * c3
b1 <- ca * (c3 - c4) / (1-a)
b2 <- 0.5 + ca/(1-a) * (c3 - qa/p * (c2 + (1-a)/2))
v1 <- (1-a) * b1^2 * (a * (ca * qa/p -1)^2 -1) -
2 * c3 * ca^2 * (3* (b1 - p * b2)^2 + (p+2) * b2 * (2*b1 - p*b2))
v2 <- n * (b1 * (b1 - p*b2) * (1-a))^2 * ca^2
v <- v1/v2
m <- 2/(ca^2 * v)

# (page 17)
critical.F.asy <- p*m * qf(1-alpha, p, m-p+1) / (c * (m-p+1))
cat("Asymptotic F critical distance:",critical.F.asy,"\n")

# The small-sample (hundreds of points) adjustment to m.
# Hardin & Rocke, 2002, page 631
m <- m * exp(0.725 - 0.00663*p -0.0780 * log(n))
# Finally, the critical point (using adjusted m)
critical.F.adj <- p*m*qf(1-alpha, p, m-p+1) / (c * (m-p+1))

#outliers <- as.integer(distance > critical)
x\$Distances <- distance
#x\$Outliers <- outliers
attr(x,"critical.chi") <- critical.chi
attr(x,"critical.F.asy") <- critical.F.asy
return(x)
}

# Try to reproduce the tables in the paper
if(FALSE){
require(mvtnorm)
# Simulate data
n <- 500;p <- 5
dat <- rmvnorm(n,rep(0,p),diag(p))

# Identify outliers
dat.out <- outliers.id(as.data.frame(dat),alpha=.05)

# Now see what percent are identified as outliers
cat("Chi-square \n")
100*sum(dat.out\$Distances>attr(dat.out,"critical.chi"))/n
cat("Approximate F asymptotic \n")
100*sum(dat.out\$Distances>attr(dat.out,"critical.F.asy"))/n