From: Martin Maechler <maechler_at_stat.math.ethz.ch>

Date: Thu, 17 Jul 2008 18:32:12 +0200

>>> shapiro.test(c(-0.644, 0.0566, 0.0566))

>>> shapiro.test(c(1,1,2))$p.value

>>> shapiro.test(c(-1,-1,2))$p.value

MC> [1] -1.03e-06

MC> [26] gtools_2.4.0

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 17 Jul 2008 - 17:11:27 GMT

Date: Thu, 17 Jul 2008 18:32:12 +0200

>>>>> "MM" == Martin Maechler <maechler_at_stat.math.ethz.ch> >>>>> on Wed, 16 Jul 2008 18:02:47 +0200 writes:

>>>>> "MC" == Mark Cowley <m.cowley_at_garvan.org.au> >>>>> on Wed, 16 Jul 2008 15:32:30 +1000 writes:

MC> Dear list,

MC> I am analysing a set of quantitative proteomics data MC> from 16 patients which has a large numbers of missing MC> data, thus some proteins are only detected once, upto a MC> maximum of 16. I want to test each protein for MC> normality by the Shapiro Wilk test (function MC> shapiro.test in package stats), which can only be MC> applied to data with at least 3 measurements, which is MC> fine. In the case where I have only 3 observations, and MC> two of those observations are identical, then the MC> shapiro.test produces negative P-values, which should MC> never happen. This occurs for all of the situations MC> that I have tried for 3 values, where 2 are the same.

MM> Yes. Since all such tests are location- and scale-invariant, you MM> can reproduce it with

MM> shapiro.test(c(0,0,1))

MM> The irony is that the original papers by Roydon and the R help MM> page all assert that the P-value for n = 3 is exact !

MM> OTOH, the paper [Roydon (1982), Appl.Stat 31, p115-124] MM> clearly states that MM> X(1) < X(2) < X(3) ... < X(n)

MM> i.e., does not allow "ties" (two equal values).

MM> If the exact formula in the paper were evaluated exactly MM> (instead with a rounded value of about 6 digits), MM> the "exact P-value" would be exactly 0.

I have now slightly increased the precision in some of the calculations involved, and also make sure that P-value >= 0 for n == 3.

This is now in both R-patched and R-devel.

Thank you, Marc, for the report!

But really, back to your data analysis question :

I cannot imagine that the P-value of a Shapiro-Wilks test with n=3 (non-NA) observations is a good tool to help you draw valid conclusions about your data ....

We have had several threads (on R-help) about the (non)sense of normality testing...

MM> Now that would count as a bug in the paper I think.

The bug is not in Roydon's math-stat paper, but arguably in the Fortran code that was published in the accompanying paper...

MM> More about this tomorrow or so.

Martin Maechler, ETH Zurich

MC> Reproducible code below: MC> # these are the data points that raised the problem

>>> shapiro.test(c(-0.644, 0.0566, 0.0566))

MC> Shapiro-Wilk normality test

MC> data: c(-0.644, 0.0566, 0.0566) MC> W = 0.75, p-value < 2.2e-16

>>> shapiro.test(c(-0.644, 0.0566, 0.0566))$p.value

MC> [1] -7.69e-07 MC> # note the verbose output shows a small, but positive P-value, but MC> when you extract that P using $p.value, it becomes negative MC> # various other tests

>>> shapiro.test(c(1,1,2))$p.value

MC> [1] -8.35e-07

>>> shapiro.test(c(-1,-1,2))$p.value

MC> [1] -1.03e-06

MC> cheers,

MC> Mark

>>> sessionInfo()

MC> R version 2.6.1 (2007-11-26)

MC> i386-apple-darwin8.10.1

MC> locale:

MC> en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

MC> attached base packages: MC> [1] tcltk graphics grDevices datasets utils stats MC> methods base MC> other attached packages: MC> [1] qvalue_1.12.0 Cairo_1.3-5 RSvgDevice_0.6.3 MC> SparseM_0.74 pwbc_0.1 MC> [6] mjcdev_0.1 tigrmev_0.1 slfa_0.1 MC> sage_0.1 qtlreaper_0.1 MC> [11] pajek_0.1 mjcstats_0.1 mjcspot_0.1 MC> mjcgraphics_0.1 mjcaffy_0.1 MC> [16] haselst_0.1 geomi_0.1 geo_0.1 MC> genomics_0.1 cor_0.1 MC> [21] bootstrap_0.1 blat_0.1 bitops_1.0-4 MC> mjcbase_0.1 gdata_2.3.1

MC> [26] gtools_2.4.0

MC> -----------------------------------------------------MC> Mark Cowley, BSc (Bioinformatics)(Hons)

MC> Peter Wills Bioinformatics Centre MC> Garvan Institute of Medical Research, Sydney, Australia

MC> ______________________________________________ MC> R-help_at_r-project.org mailing list MC> https://stat.ethz.ch/mailman/listinfo/r-helpMC> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html MC> and provide commented, minimal, self-contained, reproducible code.

MM> ______________________________________________ MM> R-help_at_r-project.org mailing list MM> https://stat.ethz.ch/mailman/listinfo/r-helpMM> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html MM> and provide commented, minimal, self-contained, reproducible code.

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 17 Jul 2008 - 17:11:27 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Thu 17 Jul 2008 - 17:31:49 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*