Re: [Rd] p.adjust(<NA>s), was "Re: [BioC] limma and p-values"

From: Gordon Smyth <smyth_at_wehi.edu.au>
Date: Sun 16 Jan 2005 - 19:44:26 EST

The new committed version of p.adjust() contains some problems:

 > p.adjust(c(0.05,0.5),method="hommel") [1] 0.05 0.50

No adjustment!

I can't see how the new treatment of NAs can be justified. One needs to distinguish between NAs which represent missing p-values and NAs which represent unknown p-values. In virtually all applications giving rise to NAs, the NAs represent missing p-values which could not be computed because of missing data. In such cases, the observed p-values should definitely be adjusted as if the NAs weren't there, because NAs represent p-values which genuinely don't exist.

I can only think of one situation in which the NAs might represent unknown but existing p-values. This would be when a large experiment has been conducted leading to many p-values. Instead of inputing all the p-values to the p.adjust() function, you decide to enter only the smallest p-values and represent the others using NAs. The trouble with this approach is that it can only be used with method="bonferroni". All the other adjustment methods are step-up or step-down methods or involve closure like Hommel's method. For these methods, you simply have to know all the p-values before you can adjust any of them.

For example, you're returning

 > p.adjust(c(0.05,NA,NA,NA,NA,NA,NA,NA,NA,NA),method="fdr")   [1] 0.5 NA NA NA NA NA NA NA NA NA

But the unknown p-values might have been like this:  >
p.adjust(c(0.05,0.051,0.051,0.051,0.051,0.051,0.051,0.051,0.051,0.051),method="fdr")   [1] 0.051 0.051 0.051 0.051 0.051 0.051 0.051 0.051 0.051 0.051

in which case the first adjusted p-value would have been 0.051 not 0.5.

How can you justify returning 0.5 for the first adjusted p-value, when the correct value could actually be a factor of 10 lower, even when the first p-value is in fact the smallest of the p-values?

At 07:29 AM 9/01/2005, Martin Maechler wrote:
>I've thought more and made experiements with R code versions
>and just now committed a new version of p.adjust() to R-devel
>--> https://svn.r-project.org/R/trunk/src/library/stats/R/p.adjust.R
>which does sensible NA handling by default and
>*additionally* has an "na.rm" argument (set to FALSE by
>default). The extended 'Examples' secion on the help page
> https://svn.r-project.org/R/trunk/src/library/stats/man/p.adjust.Rd
>shows how the new NA handling is typically much more sensible
>than using "na.rm = TRUE".
>
>Martin
>
>
> >>>>> "MM" == Martin Maechler <maechler@stat.math.ethz.ch>

> >>>>> on Sat, 8 Jan 2005 17:19:23 +0100 writes:
>
> >>>>> "GS" == Gordon K Smyth <smyth@wehi.edu.au>
> >>>>> on Sat, 8 Jan 2005 01:11:30 +1100 (EST) writes:
>
> MM> <.............>
>
> GS> p.adjust() unfortunately gives incorrect results when
> GS> 'p' includes NAs. The results from topTable are
> GS> correct. topTable() takes care to remove NAs before
> GS> passing the values to p.adjust().
>
> MM> There's at least one bug in p.adjust(): The "hommel"
> MM> method currently does not work at all with NAs (and I
> MM> have an uncommitted fix ready for this bug). OTOH, the
> MM> current version of p.adjust() ``works'' with NA's, apart
> MM> from Hommel's method, but by using "n = length(p)" in
> MM> the correction formulae, i.e. *including* the NAs for
> MM> determining sample size `n' {my fix to "hommel" would do
> MM> this as well}.
>
> MM> My question is what p.adjust() should do when there are
> MM> NA's more generally, or more specifically which `n' to
> MM> use in the correction formula. Your proposal amounts to
> MM> ``drop NA's and forget about them till the very end''
> MM> (where they are wanted in the result), i.e., your sample
> MM> size `n' would be sum(!is.na(p)) instead of length(p).

My approach to NAs (in the topTable function is the limma package) is correct in the microarray context where the NAs represent missing (non-existant) p-values which could not be computed.

> MM> To me it doesn't seem obvious that this setting "n =
> MM> #{non-NA observations}" is desirable for all P-value
> MM> adjustment methods. One argument for keeping ``n = #{all
> MM> observations}'' at least for some correction methods is
> MM> the following "continuity" one:
>
> MM> If only a few ``irrelevant'' (let's say > 0.5) P-values
> MM> are replaced by NA, the adjusted relevant small P-values
> MM> shouldn't change much, ideally not at all. I'm really
> MM> no scholar on this topic, but e.g. for "holm" I think I
> MM> would want to keep ``full n'' because of the above
> MM> continuity argument.

I don't see how the treatment of NAs follows from this continuity argument. The argument seems to be rather informal and doesn't obviously relate to the concepts of power and control of FWER and FDR, which is what the adjustment method theory is based on.

The NAs should surely be treated in a consistent way for all the adjustment methods.

> BTW, for "fdr", I don't see a
> MM> straightforward way to achieve the desired continuity.
> MM> 5D Of course, p.adjust() could adopt the possibility of
> MM> chosing how NA's should be treated e.g. by another
> MM> argument ``use.na = TRUE/FALSE'' and hence allow both
> MM> versions.
>
> MM> Feedback very welcome, particularly from ``P-value
> MM> experts'' ;-)
>
> MM> Martin Maechler, ETH Zurich

Gordon



R-devel@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sun Jan 16 18:52:37 2005

This archive was generated by hypermail 2.1.8 : Fri 18 Mar 2005 - 09:02:36 EST