Re: [R] Odp: Odp: outlying

From: Martin Maechler <maechler_at_stat.math.ethz.ch>
Date: Wed, 20 Jun 2007 14:28:34 +0200

[Note: CC'ing to R-SIG-robust, the "Special Interest Group on

                                               using Robust Statistics in R" ]

>>>>> "PP" == Petr PIKAL <petr.pikal_at_precheza.cz> >>>>> on Tue, 19 Jun 2007 12:55:37 +0200 writes:

    PP> r-help-bounces_at_stat.math.ethz.ch napsal dne 19.06.2007     PP> 12:23:58:
>> Hi
>>
>> It often depends on your attitude to limits for outlying
>> observations. Boxplot has some identifying routine for
>> selecting outlying points.
>>
>> Any procedure usually requires somebody to choose which
>> observation is outlying and why. You can use e.g. all
>> values which are beyond some threshold based on sd but
>> that holds only if distribution is normal.

yes, and that's never true for the "alternative", i.e. for the case where there *are* outliers.

>> set.seed(1)
>> x<-rnorm(x)

    PP> Sorry, it shall be

    PP> x <- rnorm(1000)

    PP> ul <- mean(x) +3*sd(x)
    PP> ll <- mean(x) -3*sd(x)
    PP> beyond <- (x>ul)  | ( x <ll)
    PP> 
    PP> > x[beyond]
    PP> [1] 3.810277

>> Regards Petr

No, really, do NOT do the above!
It only works with very few and relatively mild outliers.

There are much more robust alternatives. I show them for the simple example

x <- c(1:10, 100)

  1. As mentioned by Petr, use instead what boxplot() does, just type boxplot.stats

  and ``see what to do''. This gives Median +/- 1.5 * IQR :   i.e.,

 ## Boxplot's default rule
 str(bp.st <- boxplot.stats(x))
 bp.st$stats[ c(1,5) ]
 ## 1 10

2) Use the recommendations of Hampel (1985)

   @ARTICLE{HamF85,

     author = 	"Hampel, F.",
     title = 	"The breakdown points of the mean combined with some
		     rejection rules", 
     journal = 	"Technometrics",
     year = 	1985,
     volume = 	27,
     pages = 	"95--107",

   }    

   i.e. Median +/- 5 * MAD where MAD = is the *NON*-scaled MAD,

                                                                 ~= mad(*, constant=1)    i.e., in R

   M <- median(x)
   (FH.interval <- M + c(-5, 5) * mad(x, center=M, const=1))    ## -9 21

3) or something slightly more efficient (under approximate   normality of the non-outliers),
  e.g., based on MASS::rlm() :

 n <- length(x)
 s.rm <- summary(robm <- MASS::rlm(x ~ 1))  s.rm

 (cc <- coef(s.rm))

 ## "approximate" robust degrees of freedom; this is a hack
 ##   which could well be correct
 ##   asymptotically {where the weights would be 0/1} :
 (df.resid <- sum(robm$w) - robm$rank)
 (Tcrit <- qt(0.995, df = df.resid))

 ## Std.error of mean ~= sqrt(1/n Var(X_i)) = 1/sqrt(n) sqrt(Var(X_i))  cc[,1] + c(-1,1) * sqrt(n) * Tcrit * cc[,"Std. Error"]  ## -6.391201 18.555177

---
Martin Maechler, ETH Zurich

______________________________________________
R-help_at_stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Wed 20 Jun 2007 - 12:38:46 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 20 Jun 2007 - 13:32:08 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.