Re: [R] Boxplot philosophy {was "Boxplot in R"}

From: Ted Harding <Ted.Harding_at_nessie.mcc.ac.uk>
Date: Tue 12 Jul 2005 - 07:51:36 EST


On 11-Jul-05 Martin Maechler wrote:

>>>>>> "AdaiR" == Adaikalavan Ramasamy <ramasamy@cancer.org.uk>
>>>>>>     on Mon, 11 Jul 2005 03:04:44 +0100 writes:

>
> AdaiR> Just an addendum on the philosophical aspect of doing
> AdaiR> this. By selecting the 5% and 95% quantiles, you are
> AdaiR> always going to get 10% of the data as "extreme" and
> AdaiR> these points may not necessarily outliers. So when
> AdaiR> you are comparing information from multiple columns
> AdaiR> (i.e. boxplots), it is harder to say which column
> AdaiR> contains more extreme value compared to others etc.
>
> Yes, indeed!
>
> People {and software implementations} have several times provided
> differing definitions of how the boxplot whiskers should be defined.
>
> I strongly believe that this is very often a very bad idea!!
>
> A boxplot should be a universal mean communication and so one
> should be *VERY* reluctant redefining the outliers.
>
> I just find that Matlab (in their statistics toolbox)
> does *NOT* use such a silly 5% / 95% definition of the whiskers,
> at least not according to their documentation.
> That's very good (and I wonder where you, Larry, got the idea of
> the 5 / 95 %).
> Using such a fixed percentage is really a very inferior idea to
> John Tukey's definition {the one in use in all implementations
> of S (including R) probably for close to 20 years now}.
>
> I see one flaw in Tukey's definition {which is shared of course
> by any silly "percentage" based ``outlier'' definition}:
>
> The non-dependency on the sample size.
>
> If you have a 1000 (or even many more) points,
> you'll get more and more `outliers' even for perfectly normal data.
>
> But then, I assume John Tukey would have told us to do more
> sophisticated things {maybe things like the "violin plots"} than
> boxplot if you have really very many data points, you may want
> to see more features -- or he would have agreed to use
> boxplot(*, range = monotone_slowly_growing(n) )
> for largish sample sizes n.
>
> Martin Maechler, ETH Zurich

I happily agree with Martin's essay on Boxplot philiosophy.

It would cerainly confuse boxplot watchers if the interpretation of what they saw had to vary from case to case. The fact that careful (and necessarily detailed) explanations of what was different this time would be necessary in the text would not help much, and would defeat the primary objective of the boxplot which is to present a summary of features of the data in a form which can be grasped visually very quickly indeed.

I'm sure many of us have at times felt some frustration at the rigidly precise numerical interpretations which Tukey imposed on the elements of his many EDA techniques; but this did ensure that the viewer really knew, at a glance, what he was looking at.

EDA brilliantly combined several aspects of "looking at data": selection of features of the data; highly efficient encoding of these, and of their inter-relationships, into a medium directly adapted to visual perception; robustness (so that the perceptions were not unstable with respect to wondering just what the underlying distribution might be); accessibility (in the sense of being truly understood) to non-theoreticians; and capacity to be implemented on primitive information technology.

Indeed, one might say that the "core team" of EDA consists of the techniques for which you need only pencil and paper.

Nevertheless, Tukey was no rigid dogmatist. His objective was always to give a good representation of the data, and he would happily shift his ground, or adapt a technique (albeit probably giving it a different name), or devise a new one, if that would be useful for the case in hand.

Best wishes to all,
Ted.



E-Mail: (Ted Harding) <Ted.Harding@nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861
Date: 11-Jul-05                                       Time: 22:19:47
------------------------------ XFMail ------------------------------

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Tue Jul 12 07:59:29 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:33:30 EST