Re: [Rd] Binning of integers with hist() function odd results (P (PR#14048)

From: <gug_at_fnal.gov>
Date: Sat, 07 Nov 2009 16:05:09 +0100 (CET)


Hi,

    Thank you for responding quickly and explaining the behavior. By adding "include.lowest=TRUE,right=FALSE" and manually including breaks that resolved the simple test case. Next I updated my more complex data set, which already had manually defined breaks, and that resolved my issues there too. I have now gone in and updated all my functions which use hist() so I hopefully won't forget this in the future.

On Nov 7, 2009, at 7:57 AM, Ted Harding wrote:

> On 06-Nov-09 23:30:12, gug_at_fnal.gov wrote:
>> Full_Name: Gerald Guglielmo
>> Version: 2.8.1 (2008-12-22)
>> OS: OSX Leopard
>> Submission from: (NULL) (131.225.103.35)
>>
>> When I attempt to use the hist() function to bin integers the
>> behavior
>> seems
>> very odd as the bin boundary seems inconsistent across the various
>> bins. For
>> some bins the upper boundary includes the next integer value, while
>> in
>> others it
>> does not. If I add 0.1 to every value, then the hist() binning
>> behavior
>> is what
>> I would normally expect.
>>
>>> h1<-hist(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
>>> h1$mids
>> [1] 1.5 2.5 3.5 4.5
>>> h1$counts
>> [1] 3 3 4 5
>>> h2<-
>>> hist(c(1.1,2.1,2.1,3.1,3.1,3.1,4.1,4.1,4.1,4.1,5.1,5.1,5.1,5.1,5.1)
>>> )
>>> h2$mids
>> [1] 1.5 2.5 3.5 4.5 5.5
>>> h2$counts
>> [1] 1 2 3 4 5
>>
>> Naively I would have expected the same distribution of counts in the
>> two cases, but clearly that is not happening. This is a simple
>> example
>> to illustrate the behavior, originally I noticed this while binning a
>> large data sample where I had set the breaks=c(0,24,1).
>
> This is the correct intended bahaviour. By default, values which are
> exactly on the boundary between two bins are counted in the bin which
> is just below the boundary value. Except that the bottom-most break
> will count values on it into the bin just above it.
>
> Hence 1,2,2 all go into the [1,2] bin; 3,3,3 into (2,3];
> 4,4,4,4 into (3,4]; and 5,5,5,5,5 into (4,5]. Hence the counts
> 3,3,4,5.
>
> Since you did not set breaks in
> h1<-hist(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5)),
> they were set using the default method, and you can see what they are
> with
>
> h1$breaks
> [1] 1 2 3 4 5
>
> When you add 0.1 to each value, you push the values on the boundaries
> up into the next bin. Now each value is inside its bin, and not on
> any boundary. Hence 1.1 is in (1,2]; 2.1,2.1 in (2,3];
> 3.1,3.1,3.1 in (3,4]; 4.1,4.1,4.1,4.1 in (4,5]; and
> 5.1,5.1,5.1,5.1,5.1 in (5,6], giving counts 1,2,3,4,5 as you observe.
>
> The default behaviour described above is defined by the default
> options
>
> include.lowest = TRUE, right = TRUE
>
> where:
>
> include.lowest: logical; if 'TRUE', an 'x[i]' equal to the 'breaks'
> value will be included in the first (or last, for 'right =
> FALSE') bar. This will be ignored (with a warning) unless
> 'breaks' is a vector.
>
> right: logical; if 'TRUE', the histograms cells are right-closed
> (left open) intervals.
>
> See '?hist'. You can change this behaviour by shanging the options.
>
> Hoping this helps,
> Ted.
>
> --------------------------------------------------------------------
> E-Mail: (Ted Harding) <Ted.Harding_at_manchester.ac.uk>
> Fax-to-email: +44 (0)870 094 0861
> Date: 07-Nov-09 Time: 13:57:07
> ------------------------------ XFMail ------------------------------

-- 
-Jerry->
gug_at_fnal.gov
Pepe's Theory of everything: "Under the right circumstances, things  
happen."


	[[alternative HTML version deleted]]

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Sat 07 Nov 2009 - 15:18:51 GMT

This archive was generated by hypermail 2.2.0 : Sun 08 Nov 2009 - 09:30:21 GMT