[R] Bucketing/Grouping Probabilities

From: Random Walker <kinch1967_at_gmail.com>
Date: Wed, 19 Nov 2008 07:43:57 -0800 (PST)

I have a list of entrants (1-14 in this example) in a competitive event and corresponding win probabilities for each entrant.

[(1, 0.049), (2, 0.129), (3, 0.043), (4, 0.013), (5, 0.015), (6, 0.040), (7, 0.066), (8, 0.038), (9, 0.204), (10, 0.022), (11, 0.234), (12, 0.044), (13, 0.068), (14, 0.035)]

So, of course Sum(ps) = 1.

In order to make some subsequent computations more tractable, I wish to cluster entrant win probabilities like so:

[(1, 0.049), (2, 0.121), (3, 0.049), (4, 0.024), (5, 0.024), (6, 0.049), (7, 0.072), (8, 0.049), (9, 0.185), (10, 0.024), (11, 0.185), (12, 0.049), (13, 0.072), (14, 0.049)]

viz. in this case I have 'bucketed' the entrant numbers against 5 representative probabilities and in subsequent computations will deem (for example) the win probability of 3 to be 0.049, so another way of visualising the result is:

[((4, 5, 10), 0.024),

((3, 6, 8, 12, 14), 0.049),

((7, 13), 0.072),
((2), 0.121),
((11), 0.185)]

and (3 * 0.024) + (5 * 0.049) + (2 * 0.072) + (1 x 0.121) + (1 x 0.185) ~= 1.

My question is: What is the most 'correct' way to cluster these probabilities? In my case the problem is not totally unconstrained. I would like to specify the number of buckets (probably will always wish to use either 5 or 6), so I do not need an algorithm which determines the most appropriate number of buckets given some cost function. I just need to know for a given number of buckets, which entrants go in which buckets and what is the representative probability for each bucket.

The first thing which occurs to me is to sort probabilities into ascending order, generate all partitions of the list into (say) 5 buckets, and pick the partition which minimises the sum of squared differences from the mean of each bucket summed over all buckets. If buckets were not associated with probabilities I would do this without a second thought... but I wonder if this is the right thing to do here? I'm too statistically naive to know one way or the other.

I would appreciate any suggestions re correct approach and also (obviously) any tips on how one might go about this in R using canned functions.

Many thanks!

-- 
View this message in context: http://www.nabble.com/Bucketing-Grouping-Probabilities-tp20582544p20582544.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Wed 19 Nov 2008 - 15:54:23 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 19 Nov 2008 - 17:30:28 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive