From: Robert A LaBudde <ral_at_lcfltd.com>

Date: Sat, 09 Jun 2007 16:26:09 -0400

Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: ral_at_lcfltd.com

R-help_at_stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 09 Jun 2007 - 20:34:08 GMT

Date: Sat, 09 Jun 2007 16:26:09 -0400

At 12:57 PM 6/9/2007, Marco wrote:

><snip>

*>2.I found various version of P-P plot where instead of using the
**>"ecdf" function use ((1:n)-0.5)/n
**> After investigation I found there're different definition of ECDF
**>(note "i" is the rank):
**> * Kaplan-Meier: i/n
**> * modified Kaplan-Meier: (i-0.5)/n
**> * Median Rank: (i-0.3)/(n+0.4)
**> * Herd Johnson i/(n+1)
**> * ...
**> Furthermore, similar expressions are used by "ppoints".
**> So,
**> 2.1 For P-P plot, what shall I use?
**> 2.2 In general why should I prefer one kind of CDF over another one?
**><snip>
*

This is an age-old debate in statistics. There are many different formulas, some of which are optimal for particular distributions.

Using i/n (which I would call the Kolmogorov method), (i-1)/n or i/(n+1) is to be discouraged for general ECDF modeling. These correspond in quality to the rectangular rule method of integration of the bins, and assume only that the underlying density function is piecewise constant. There is no disadvantage to using these methods, however, if the pdf has multiple discontinuities.

I tend to use (i-0.5)/n, which corresponds to integrating with the "midpoint rule", which is a 1-point Gaussian quadrature, and which is exact for linear behavior with derivative continuous. It's simple, it's accurate, and it is near optimal for a wide range of continuous alternatives.

The formula (i- 3/8)/(n + 1/4) is optimal for the normal distribution. However, it is equal to (i-0.5)/n to order 1/n^3, so there is no real benefit to using it. Similarly, there is a formula (i-.44)/(N+.12) for a Gumbel distribution. If you do know for sure (don't need to test) the form of the distribution, you're better off fitting that distribution function directly and not worrying about the edf.

Also remember that edfs are not very accurate, so the differences between these formulae are difficult to justify in practice.

Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: ral_at_lcfltd.com

Least Cost Formulations, Ltd. URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239 Fax: 757-467-2947

"Vere scire est per causas scire"

R-help_at_stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 09 Jun 2007 - 20:34:08 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Sun 10 Jun 2007 - 02:31:41 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*