Re: R-beta: Bug or feature? [+ better histograms]

Bill Venables (wvenable@attunga.stats.adelaide.edu.au)
Mon, 18 Aug 1997 08:43:16 +0930


Date: Mon, 18 Aug 1997 08:43:16 +0930
Message-Id: <9708172313.AA15474@attunga.stats.adelaide.edu.au>
From: Bill Venables <wvenable@attunga.stats.adelaide.edu.au>
To: Ross Ihaka <ihaka@stat.auckland.ac.nz>
Subject: Re: R-beta: Bug or feature?  [+ better histograms]
In-Reply-To: <199708170921.VAA03388@stat1.stat.auckland.ac.nz>

[On the differences between substitute() in R and S]

I would not suggest changing it, but somehow it has to be pretty
clearly signposted.

Below is the code that caused me to discover this difference, by
the way.  It also uncovered differences between polygon() and
cut() on R and S but these seem less important.

First a bit of an explanation.  Brian Ripley and I are firmly of
the opinion that, to be any use in teaching at all, histograms
should be nonparametric estimators of the probability density
function.  That is, the vertical scale should be a relative
frequency density scale.  It seems impossible to get this in R, so
here is a function (+ a few extras) that does give you a "true"
histogram, (unless you are peverse enough to request otherwise.)

truehist <- function(data, nbins = nclass.scott(data), h, x0 =
	 -h/1000, breaks, prob = T, xlim = range(breaks), ymax =
	 max(est), col = NULL, border = "black", ylab = if(prob)
	 "Density" else "Frequency", xlab =
	 deparse(substitute(data)), bty = "n", ...)
{
  xlab <- xlab		# must come first!
  data <- data[!is.na(data)]
  if(missing(breaks)) {
    if(missing(h)) h <- diff(pretty(data, nbins))[1]
    first <- floor((min(data) - x0)/h)
    last <- ceiling((max(data) - x0)/h)
    breaks <- x0 + h * c(first:last)
  }
  if(any(diff(breaks) <= 0)) stop("breaks must be strictly increasing")
  if(min(data) < min(breaks) || max(data) > max(breaks))
     stop("breaks do not cover the data")
  db <- diff(breaks)
  if(!prob && sqrt(var(db)) > mean(db)/1000)
    warning("Uneven breaks with prob = F will give a misleading plot")
  o <- data == breaks[1]		# cut has no 'include.lowest'
  if(any(o)) data[o] <- data[o] + .Machine$single.eps
  bin <- cut(data, breaks)
  est <- tabulate(bin, length(levels(bin)))
  if(prob) est <- est/(diff(breaks) * length(data))
  n <- length(breaks)
  plot(xlim, c(0, ymax), type = "n", xlab = xlab, ylab = ylab, bty = bty)
  rect(breaks[-n], 0, breaks[-1], est, col = col, border = border)
  invisible()
}

nclass.scott <- function(x) {
    h <- 3.5 * sqrt(var(x)) * length(x)^{-1/3}
    ceiling(diff(range(x))/h)
}

nclass.freq <- function(x) {
    h <- 2.15 * sqrt(var(x)) * length(x)^{-1/5}
    ceiling(diff(range(x))/h)
}

-- 
Bill Venables, Head, Dept of Statistics,    Tel.: +61 8 8303 5418
University of Adelaide,                     Fax.: +61 8 8303 3696
South AUSTRALIA.     5005.   Email: Bill.Venables@adelaide.edu.au
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request@stat.math.ethz.ch
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=