# Re: [R] Bucketing/Grouping Probabilities

From: Gabor Grothendieck <ggrothendieck_at_gmail.com>
Date: Wed, 19 Nov 2008 11:10:04 -0500

Try this:

x <- c(1, 0.049, 0.129, 0.043, 0.013, 0.015, 0.040, 0.066, 0.038, 0.2040, 0.0221, 0.234, 0.0443, 0.0684, 0.035) cl <- kmeans(x, 5)
cl
newold <- with(cl, data.frame(old = x, new = centers[cluster])) newold

On Wed, Nov 19, 2008 at 10:43 AM, Random Walker <kinch1967_at_gmail.com> wrote:
>
> I have a list of entrants (1-14 in this example) in a competitive event and
> corresponding win probabilities for each entrant.
>
> [(1, 0.049), (2, 0.129), (3, 0.043), (4, 0.013), (5, 0.015), (6,
> 0.040), (7, 0.066), (8, 0.038), (9, 0.204), (10, 0.022), (11, 0.234),
> (12, 0.044), (13, 0.068), (14, 0.035)]
>
> So, of course Sum(ps) = 1.
>
> In order to make some subsequent computations more tractable, I wish to
> cluster entrant win probabilities like so:
>
> [(1, 0.049), (2, 0.121), (3, 0.049), (4, 0.024), (5, 0.024), (6,
> 0.049), (7, 0.072), (8, 0.049), (9, 0.185), (10, 0.024), (11, 0.185),
> (12, 0.049), (13, 0.072), (14, 0.049)]
>
> viz. in this case I have 'bucketed' the entrant numbers against 5
> representative probabilities and in subsequent computations will deem (for
> example) the win probability of 3 to be 0.049, so another way of visualising
> the result is:
>
> [((4, 5, 10), 0.024),
> ((3, 6, 8, 12, 14), 0.049),
> ((7, 13), 0.072),
> ((2), 0.121),
> ((11), 0.185)]
>
> and (3 * 0.024) + (5 * 0.049) + (2 * 0.072) + (1 x 0.121) + (1 x 0.185) ~=
> 1.
>
> My question is: What is the most 'correct' way to cluster these
> probabilities? In my case the problem is not totally unconstrained. I would
> like to specify the number of buckets (probably will always wish to use
> either 5 or 6), so I do not need an algorithm which determines the most
> appropriate number of buckets given some cost function. I just need to know
> for a given number of buckets, which entrants go in which buckets and what
> is the representative probability for each bucket.
>
> The first thing which occurs to me is to sort probabilities into ascending
> order, generate all partitions of the list into (say) 5 buckets, and pick
> the partition which minimises the sum of squared differences from the mean
> of each bucket summed over all buckets. If buckets were not associated with
> probabilities I would do this without a second thought... but I wonder if
> this is the right thing to do here? I'm too statistically naive to know one
> way or the other.
>
> I would appreciate any suggestions re correct approach and also (obviously)
>
> Many thanks!
>
>
>
> --
> View this message in context: http://www.nabble.com/Bucketing-Grouping-Probabilities-tp20582544p20582544.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help