Re: [R] negative P-values with shapiro.test

From: Martin Maechler <maechler_at_stat.math.ethz.ch>
Date: Thu, 17 Jul 2008 18:32:12 +0200

>>>>> "MM" == Martin Maechler <maechler_at_stat.math.ethz.ch> >>>>> on Wed, 16 Jul 2008 18:02:47 +0200 writes:

>>>>> "MC" == Mark Cowley <m.cowley_at_garvan.org.au> >>>>> on Wed, 16 Jul 2008 15:32:30 +1000 writes:

    MC> Dear list,

    MC> I am analysing a set of quantitative proteomics data
    MC> from 16 patients which has a large numbers of missing
    MC> data, thus some proteins are only detected once, upto a
    MC> maximum of 16.  I want to test each protein for
    MC> normality by the Shapiro Wilk test (function
    MC> shapiro.test in package stats), which can only be
    MC> applied to data with at least 3 measurements, which is
    MC> fine. In the case where I have only 3 observations, and
    MC> two of those observations are identical, then the
    MC> shapiro.test produces negative P-values, which should
    MC> never happen.  This occurs for all of the situations
    MC> that I have tried for 3 values, where 2 are the same.


    MM> Yes. Since all such tests are location- and scale-invariant, you     MM> can reproduce it with

    MM> shapiro.test(c(0,0,1))

    MM> The irony is that the original papers by Roydon and the R help     MM> page all assert that the P-value for n = 3 is exact !

    MM> OTOH, the paper [Roydon (1982), Appl.Stat 31, p115-124]
    MM> clearly states that 
    MM> X(1) < X(2) < X(3) ... < X(n)

    MM> i.e., does not allow "ties" (two equal values).

    MM> If the exact formula in the paper were evaluated exactly 
    MM> (instead with a rounded value of about 6 digits),
    MM> the "exact P-value" would be exactly 0.

I have now slightly increased the precision in some of the calculations involved, and also make sure that P-value >= 0 for n == 3.

This is now in both R-patched and R-devel.

Thank you, Marc, for the report!
But really, back to your data analysis question :

I cannot imagine that the P-value of a Shapiro-Wilks test with n=3 (non-NA) observations is a good tool to help you draw valid conclusions about your data ....

We have had several threads (on R-help) about the (non)sense of normality testing...

    MM> Now that would count as a bug in the paper I think.

The bug is not in Roydon's math-stat paper, but arguably in the Fortran code that was published in the accompanying paper...

    MM> More about this tomorrow or so.

Martin Maechler, ETH Zurich

    MC> Reproducible code below:
    MC> # these are the data points that raised the problem

>>> shapiro.test(c(-0.644, 0.0566, 0.0566))

    MC> Shapiro-Wilk normality test

    MC> data: c(-0.644, 0.0566, 0.0566)     MC> W = 0.75, p-value < 2.2e-16

>>> shapiro.test(c(-0.644, 0.0566, 0.0566))$p.value

    MC> [1] -7.69e-07
    MC> # note the verbose output shows a small, but positive P-value, but  
    MC> when you extract that P using $p.value, it becomes negative
    MC> # various other tests

>>> shapiro.test(c(1,1,2))$p.value
    MC> [1] -8.35e-07

>>> shapiro.test(c(-1,-1,2))$p.value
    MC> [1] -1.03e-06

    MC> cheers,

    MC> Mark

>>> sessionInfo()

    MC> R version 2.6.1 (2007-11-26)
    MC> i386-apple-darwin8.10.1

    MC> locale:
    MC> en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

    MC> attached base packages:
    MC> [1] tcltk     graphics  grDevices datasets  utils     stats      
    MC> methods   base

    MC> other attached packages:
    MC> [1] qvalue_1.12.0    Cairo_1.3-5      RSvgDevice_0.6.3  
    MC> SparseM_0.74     pwbc_0.1
    MC> [6] mjcdev_0.1       tigrmev_0.1      slfa_0.1          
    MC> sage_0.1         qtlreaper_0.1
    MC> [11] pajek_0.1        mjcstats_0.1     mjcspot_0.1       
    MC> mjcgraphics_0.1  mjcaffy_0.1
    MC> [16] haselst_0.1      geomi_0.1        geo_0.1           
    MC> genomics_0.1     cor_0.1
    MC> [21] bootstrap_0.1    blat_0.1         bitops_1.0-4      
    MC> mjcbase_0.1      gdata_2.3.1

    MC> [26] gtools_2.4.0
    MC> -----------------------------------------------------
    MC> Mark Cowley, BSc (Bioinformatics)(Hons)

    MC> Peter Wills Bioinformatics Centre     MC> Garvan Institute of Medical Research, Sydney, Australia

    MC> ______________________________________________
    MC> R-help_at_r-project.org mailing list
    MC> https://stat.ethz.ch/mailman/listinfo/r-help
    MC> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html     MC> and provide commented, minimal, self-contained, reproducible code.
    MM> ______________________________________________
    MM> R-help_at_r-project.org mailing list
    MM> https://stat.ethz.ch/mailman/listinfo/r-help
    MM> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html     MM> and provide commented, minimal, self-contained, reproducible code.

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 17 Jul 2008 - 17:11:27 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 17 Jul 2008 - 17:31:49 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive