[R] Boxplot philosophy {was "Boxplot in R"}

From: Martin Maechler <maechler_at_stat.math.ethz.ch>
Date: Mon 11 Jul 2005 - 22:36:35 EST

>>>>> "AdaiR" == Adaikalavan Ramasamy <ramasamy@cancer.org.uk> >>>>> on Mon, 11 Jul 2005 03:04:44 +0100 writes:

    AdaiR> Just an addendum on the philosophical aspect of doing
    AdaiR> this.  By selecting the 5% and 95% quantiles, you are
    AdaiR> always going to get 10% of the data as "extreme" and
    AdaiR> these points may not necessarily outliers.  So when
    AdaiR> you are comparing information from multiple columns
    AdaiR> (i.e.  boxplots), it is harder to say which column
    AdaiR> contains more extreme value compared to others etc.

Yes, indeed!

People {and software implementations} have several times provided differing definitions of how the boxplot whiskers should be defined.

I strongly believe that this is very often a very bad idea!!

A boxplot should be a universal mean communication and so one should be *VERY* reluctant redefining the outliers.

I just find that Matlab (in their statistics toolbox) does *NOT* use such a silly 5% / 95% definition of the whiskers, at least not according to their documentation. That's very good (and I wonder where you, Larry, got the idea of the 5 / 95 %).
Using such a fixed percentage is really a very inferior idea to John Tukey's definition {the one in use in all implementations of S (including R) probably for close to 20 years now}.

I see one flaw in Tukey's definition {which is shared of course by any silly "percentage" based ``outlier'' definition}:

   The non-dependency on the sample size.

If you have a 1000 (or even many more) points, you'll get more and more `outliers' even for perfectly normal data.

But then, I assume John Tukey would have told us to do more sophisticated things {maybe things like the "violin plots"} than boxplot if you have really very many data points, you may want to see more features -- or he would have agreed to use

   boxplot(*, range = monotone_slowly_growing(n) ) for largish sample sizes n.

Martin Maechler, ETH Zurich

    AdaiR> Regards, Adai

    AdaiR> On Sun, 2005-07-10 at 18:10 -0500, Larry Xie wrote:
>> I am trying to draw a plot like Matlab does:
>>
>> The upper extreme whisker represents 95% of the data;
>> The upper hinge represents 75% of the data;
>> The median represents 50% of the data;
>> The lower hinge represents 25% of the data;
>> The lower extreme whisker represents 5% of the data.
>>
>> It looks like:
>>
>> --- 95%
>> |
>> |
>> ------- 75%
>> | |
>> |-----| 50%
>> | |
>> | |
>> ------- 25%
>> |
>> --- 5%
>>
>> Anyone can give me some hints as to how to draw a boxplot like that?
>> What function does it? I tried boxplot() but couldn't figure it out.
>> If it's boxplot(), what arguments should I pass to the function? Thank
>> you for your help. I'd appreciate it.



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Mon Jul 11 22:43:24 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:33:28 EST