Re: [R] what does cut(data, breaks=n) actually do?

From: Tony Plate <tplate_at_acm.org>
Date: Fri, 14 Dec 2007 22:45:46 -0700


Peter Dalgaard wrote:
> melissa cline wrote:

>> Hello,
>>
>> I'm trying to bin a quantity into 2-3 bins for calculating entropy and
>> mutual information.  One of the approaches I'm exploring is the cut()
>> function, which is what the mutualInfo function in binDist uses.  When it's
>> called in the format cut(data, breaks=n), it somehow splits the data into n
>> distinct bins.  Can anyone tell me how cut() decides where to cut?
>>
>>   

> This is one case where reading the actual R code is easier that
> explaining what it does. From cut.default
>
> if (length(breaks) == 1) {
> if (is.na(breaks) | breaks < 2)
> stop("invalid number of intervals")
> nb <- as.integer(breaks + 1)
> dx <- diff(rx <- range(x, na.rm = TRUE))
> if (dx == 0)
> dx <- rx[1]
> breaks <- seq.int(rx[1] - dx/1000, rx[2] + dx/1000, length.out = nb)
> }
>
> so basically it takes the range, extends it a bit and splits in into
> <breaks> equally long segments.
>
> (For the sometimes more attractive option of splitting into groups of
> roughly equal size, there is cut2 in the Hmisc package, or use quantile())
>

It can be a bit dangerous to use quantile() to provide breaks for cut(), because quantiles can be non-unique, which cut() doesn't like:
> x1 <- c(1,1,1,1,1,1,1,1,1,2)
> cut(x1, breaks=quantile(x1, (0:2)/2))
Error in cut.default(x1, breaks = quantile(x1, (0:2)/2)) :

   'breaks' are not unique
>

However, cut2() in Hmisc handles this situation gracefully:
> library(Hmisc)

Attaching package: 'Hmisc'

        The following object(s) are masked from package:base :
          format.pval,
          round.POSIXt,
          trunc.POSIXt,
          units

> cut2(x1, g=2)

  [1] 1 1 1 1 1 1 1 1 1 2
Levels: 1 2
>

(Additionally, a potentially dangerous peculiarity of quantile() for this kind of purpose is that its return values can be out of order (i.e., diff(quantile(...))<0, at rounding error level), but this doesn't actually upset cut() in R because cut() sorts the breaks prior to using them.)


R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 15 Dec 2007 - 05:50:04 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 15 Dec 2007 - 08:30:19 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.