Re: R-beta: Bug or feature? [+ better histograms]

Mon, 18 Aug 1997 08:43:16 +0930

```Date: Mon, 18 Aug 1997 08:43:16 +0930
To: Ross Ihaka <ihaka@stat.auckland.ac.nz>
Subject: Re: R-beta: Bug or feature?  [+ better histograms]

[On the differences between substitute() in R and S]

I would not suggest changing it, but somehow it has to be pretty
clearly signposted.

Below is the code that caused me to discover this difference, by
the way.  It also uncovered differences between polygon() and
cut() on R and S but these seem less important.

First a bit of an explanation.  Brian Ripley and I are firmly of
the opinion that, to be any use in teaching at all, histograms
should be nonparametric estimators of the probability density
function.  That is, the vertical scale should be a relative
frequency density scale.  It seems impossible to get this in R, so
here is a function (+ a few extras) that does give you a "true"
histogram, (unless you are peverse enough to request otherwise.)

truehist <- function(data, nbins = nclass.scott(data), h, x0 =
-h/1000, breaks, prob = T, xlim = range(breaks), ymax =
max(est), col = NULL, border = "black", ylab = if(prob)
"Density" else "Frequency", xlab =
deparse(substitute(data)), bty = "n", ...)
{
xlab <- xlab		# must come first!
data <- data[!is.na(data)]
if(missing(breaks)) {
if(missing(h)) h <- diff(pretty(data, nbins))[1]
first <- floor((min(data) - x0)/h)
last <- ceiling((max(data) - x0)/h)
breaks <- x0 + h * c(first:last)
}
if(any(diff(breaks) <= 0)) stop("breaks must be strictly increasing")
if(min(data) < min(breaks) || max(data) > max(breaks))
stop("breaks do not cover the data")
db <- diff(breaks)
if(!prob && sqrt(var(db)) > mean(db)/1000)
warning("Uneven breaks with prob = F will give a misleading plot")
o <- data == breaks[1]		# cut has no 'include.lowest'
if(any(o)) data[o] <- data[o] + .Machine\$single.eps
bin <- cut(data, breaks)
est <- tabulate(bin, length(levels(bin)))
if(prob) est <- est/(diff(breaks) * length(data))
n <- length(breaks)
plot(xlim, c(0, ymax), type = "n", xlab = xlab, ylab = ylab, bty = bty)
rect(breaks[-n], 0, breaks[-1], est, col = col, border = border)
invisible()
}

nclass.scott <- function(x) {
h <- 3.5 * sqrt(var(x)) * length(x)^{-1/3}
ceiling(diff(range(x))/h)
}

nclass.freq <- function(x) {
h <- 2.15 * sqrt(var(x)) * length(x)^{-1/5}
ceiling(diff(range(x))/h)
}

--
Bill Venables, Head, Dept of Statistics,    Tel.: +61 8 8303 5418
University of Adelaide,                     Fax.: +61 8 8303 3696