From: Ted Harding <Ted.Harding_at_manchester.ac.uk>

Date: Wed, 23 Dec 2009 18:14:05 +0000 (GMT)

>> -----Original Message-----

*>> From: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-
*

*>> project.org] On Behalf Of Bernardo Rangel Tura
*

*>> Sent: Tuesday, December 22, 2009 1:16 AM
*

*>> To: ivorytower_at_emails.bjut.edu.cn
*

*>> Cc: R-help_at_r-project.org
*

*>> Subject: Re: [R] Problem with "Cannot compute correct p-values with
*

*>> ties"
*

*>>
*

*>> On Wed, 2009-12-02 at 16:52 +0800, Zhijiang Wang wrote:
*

*>> > Dear All,
*

*>> > 1. why did the problem happen?
*

*>> > 2. How to solve it?
*

*>> >
*

*>> > --
*

*>> >
*

*>> > Best wishes,
*

*>> > Zhijiang Wang
*

*>>
*

*>>
*

*>> Well... The algorithm for Mann-whitney test have problem with ties
*

*>>
*

*>> To solve you can use jitter
*

*>>
*

*>> a<-1:10
*

*>> b<-1:10
*

*>> wilcox.test(a,b)
*

*>>
*

*>> Wilcoxon rank sum test with continuity correction
*

*>>
*

*>> data: a and b
*

*>> W = 50, p-value = 1
*

*>> alternative hypothesis: true location shift is not equal to 0
*

*>>
*

*>> Warning message:
*

*>> In wilcox.test.default(a, b) : cannot compute exact p-value with ties
*

*>>
*

*>> wilcox.test(a,jitter(b))
*

*>>
*

*>> Wilcoxon rank sum test
*

*>>
*

*>> data: a and jitter(b)
*

*>> W = 49, p-value = 0.9705
*

*>> alternative hypothesis: true location shift is not equal to 0
*

*>>
*

*>> look ?jitter for more information
*

*>>
*

*>> --
*

*>> Bernardo Rangel Tura, M.D,MPH,Ph.D
*

*>> National Institute of Cardiology
*

*>> Brazil
*

E-Mail: (Ted Harding) <Ted.Harding_at_manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 23 Dec 2009 - 18:16:22 GMT

Date: Wed, 23 Dec 2009 18:14:05 +0000 (GMT)

The issue of ties in the Mann-Whitney test needs some thought.
The distribution function of the Mann-Whitney test is derived
on the assumption that (in effect) the data are continuous
variables so that (theoretically) there should be no ties.
Whn ties occur, then this assumjtion has failed.

- If the data represent a continuous underlying variable which has been recorded to a relatively coarse precision ("binned"), so that some ties are likely, then random "tie-breaking" is a plausible solution. In effect, a tie-cluster of size k would then represent k unequal observations which could have been in any one of k! orders (indistinguishable from the data in hand).

On the assumption that the "bin" width is so small that their possible distinct unobserved values can not be so different that any difference would materially effect the probabilities of different orderings (i.e. they can be considered as if they were uniformly distributed over the "bin"), then these k! orders can be considered equally likely. Then the "jittering" (adding small independent noise values to each of the equal data) will yield one of these k! orderings with the same probability for each. Then, when all the tie-clusters have been broken in this way, the P-value for the Mann-Whitney will be exactly correct for that particular breaking of the ties.

However, this is just one of the k! possible orderings; similarly for any other tie-clusters in the data.

A different "jittering" would yield a different ordering, and a different P-value. So what to choose? Well, you have to recognise that they are all possible as far as the data tell, and all equally likely. So an appropriate approach is to simulate a lot of random tie-breaks, getting a P-value for each, and ending up with an adequately large sample of random P-values.

What you do with this depends on what you need, and on how they are distributed. If, for instance, all of 10000 P-values were less than, say, 0.001, and 0.001 was an adequate P-value for your purposes, then you can be very confident that you have a "significant result" -- in other words, if you had known the exact underlying values (with no ties), then it is almost certain that you would still have got a P < 0.001 test result.

Similarly, maybe 2 out of 10000 are greater than 0.01, the rest less. Then you can be fairly confident that the "true" P-value is less than 0.01.

Or you could estimate the "true" P-value as the mean of the simulated ones (preferable with a Standard Error too). Indeed, you could simply compute a confidence interval for the P-value (but you would have to choose the confidence level).

If there are only a few ties in the data, then a complete enumeration of all possible tie-breaks is feasible. You then have everything you could possibly need to know, given the data, relative to this approach.

2. If the data represent essentially discrete values (e.g. they are count data, or ordered categorical), where ties are intrinsically possible, then strictly speaking the Mann-Whitney test is not appropriate, since its distribution function depends on the assumption of continuity which is not true here.

However, nothing prevents you adopting the Mann=-Whitney statistic as your test statistic of choice. The only problem is that you may not refer its valuer to the Mann-Whitney distribution.

If there are k ordered categories C1 < C2 < ... < Ck, then the Null Hypothesis is that Prob(X in Cj) is the same for each of the two groups of data. It is then possible to devise a "permutation test", whose evaluation for the data in hand could again be achieved by random simulation. But you're also getting into contingency table territory here, which is a somwhat different kind of universe!

Hoping this adds something useful!

Ted.

On 23-Dec-09 17:32:50, Greg Snow wrote:

> Adding random noise to data in order to avoid a warning is like > removing the batteries from a smoke detector to silence it rather than > investigating the what is causing the alarm to go off. > > If the function is giving a warning it is best to investigate why, it > is possible that you can ignore the warning (the burnt toast of smoke > alarm analogies) but it is best to convince yourself that it is ok. It > is also possible in this case that another tool may be more > appropriate, and investigating the warning could help you find that > tool. > > -- > Gregory (Greg) L. Snow Ph.D. > Statistical Data Center > Intermountain Healthcare > greg.snow_at_imail.org > 801.408.8111 > >

>> -----Original Message-----

E-Mail: (Ted Harding) <Ted.Harding_at_manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861

Date: 23-Dec-09 Time: 18:14:03 ------------------------------ XFMail ------------------------------ ______________________________________________R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 23 Dec 2009 - 18:16:22 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Wed 23 Dec 2009 - 20:30:32 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*