From: Martin Maechler <maechler_at_stat.math.ethz.ch>

Date: Mon 11 Jul 2005 - 22:36:35 EST

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Mon Jul 11 22:43:24 2005

Date: Mon 11 Jul 2005 - 22:36:35 EST

>>>>> "AdaiR" == Adaikalavan Ramasamy <ramasamy@cancer.org.uk> >>>>> on Mon, 11 Jul 2005 03:04:44 +0100 writes:

AdaiR> Just an addendum on the philosophical aspect of doing AdaiR> this. By selecting the 5% and 95% quantiles, you are AdaiR> always going to get 10% of the data as "extreme" and AdaiR> these points may not necessarily outliers. So when AdaiR> you are comparing information from multiple columns AdaiR> (i.e. boxplots), it is harder to say which column AdaiR> contains more extreme value compared to others etc.

Yes, indeed!

People {and software implementations} have several times provided differing definitions of how the boxplot whiskers should be defined.

I strongly believe that this is very often a very bad idea!!

A boxplot should be a universal mean communication and so one should be *VERY* reluctant redefining the outliers.

I just find that Matlab (in their statistics toolbox)
does *NOT* use such a silly 5% / 95% definition of the whiskers,
at least not according to their documentation.
That's very good (and I wonder where you, Larry, got the idea of
the 5 / 95 %).

Using such a fixed percentage is really a very inferior idea to
John Tukey's definition {the one in use in all implementations
of S (including R) probably for close to 20 years now}.

I see one flaw in Tukey's definition {which is shared of course by any silly "percentage" based ``outlier'' definition}:

The non-dependency on the sample size.

If you have a 1000 (or even many more) points, you'll get more and more `outliers' even for perfectly normal data.

But then, I assume John Tukey would have told us to do more sophisticated things {maybe things like the "violin plots"} than boxplot if you have really very many data points, you may want to see more features -- or he would have agreed to use

boxplot(*, range = monotone_slowly_growing(n) ) for largish sample sizes n.

Martin Maechler, ETH Zurich

AdaiR> Regards, Adai

AdaiR> On Sun, 2005-07-10 at 18:10 -0500, Larry Xie wrote:

>> I am trying to draw a plot like Matlab does:

* >>
** >> The upper extreme whisker represents 95% of the data;
** >> The upper hinge represents 75% of the data;
** >> The median represents 50% of the data;
** >> The lower hinge represents 25% of the data;
** >> The lower extreme whisker represents 5% of the data.
** >>
** >> It looks like:
** >>
** >> --- 95%
** >> |
** >> |
** >> ------- 75%
** >> | |
** >> |-----| 50%
** >> | |
** >> | |
** >> ------- 25%
** >> |
** >> --- 5%
** >>
** >> Anyone can give me some hints as to how to draw a boxplot like that?
** >> What function does it? I tried boxplot() but couldn't figure it out.
** >> If it's boxplot(), what arguments should I pass to the function? Thank
** >> you for your help. I'd appreciate it.
*

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Mon Jul 11 22:43:24 2005

*
This archive was generated by hypermail 2.1.8
: Fri 03 Mar 2006 - 03:33:28 EST
*