Re: [R] sample function

From: Ted Harding <Ted.Harding_at_nessie.mcc.ac.uk>
Date: Fri 11 Mar 2005 - 21:59:55 EST


On 11-Mar-05 Martin C. Martin wrote:
> "hist" is lumping things together.
>
> Try:
> sum(temp == 0)
>
> compare to the height of the left most bar.
>
> Is this a bug in hist?
>
> - Martin

Well, not a bug strictly speaking since "it works as documented", but I do think it's not necessarily a happy choice.

The unsuspecting (like Martin) will step into holes even after reading "?hist", since the truths are rather deeply (and I think somewhat obliquely) hidden ("?hist" leads you to look up "?nclass.Sturges" which in turn only mentions "Sturges' formula" and invites you to read V&R's MASS book and other references in the hope of further clarification -- all a bit much when you just want to draw a histogram, which ought to be kid's stuff! Not to mention the things to do with parameters "include.lowest" and "right" whose combined effect is not too obvious).

I'd like to repeat the sort of hint I occasionally give:

In using R, if there's any doubt it is best to spell out exactly what you want rather than expecting the functions to agree with what you want. R functions are often more complex and subtle than you might suspect.

In this particular case,

  hist(temp,breaks= -0.5+(-0:14) )

will produce the sort of thing which is wanted. One could interpret the results which Martin reported as due to a sort of "confusion" (but on whose part -- R or Martin?) over the fact that "hist" is designed to deal with "continuous" values, while his sample consists of integers.

For that particular case, one could also use "table" or "barchart", as has been suggested by David Scott, which would produce a plot of similar appearance; but this is not in the "histogram family" despite appearances, since it is not primarily a "quantitative" plot (i.e. respecting the numerical values and their numerical comparisons), but more a "catefory count". In particular, natural variants of the above "hist" command such as

  hist(temp,breaks= -0.5+2*(0:7) )

(which corresponds to binning by different intervals) do not lie so easily in the "table" or "barchart" domain.

And I don't agree with David's comment that "No, hist is the wrong thing to use to display this data."

In so far as these data are considered to be numerical values of which one wants a view of their distribution, then "hist" is entirely appropriate, as for any other numerical variable. The only question is how to get this to happen appropriately.

Would David make the same comment about data sampled from (0:5000) instead of (0:12)?

Best wishes to all,
Ted.



E-Mail: (Ted Harding) <Ted.Harding@nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861
Date: 11-Mar-05                                       Time: 10:59:55
------------------------------ XFMail ------------------------------

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Mon Mar 14 10:34:03 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:30:47 EST