Re: [Rd] ecdf with lots of ties is inefficient (PR#7292)

From: Prof Brian Ripley <>
Date: Sun 17 Oct 2004 - 17:24:04 EST

This seems a _very_ unusual use of ecdf -- what are you using it for that a sample of size 10,000 would not do equally well?

If you have a need for a more efficient version of ecdf, please develop one and submit a patch. I don't think it would be hard as ecdf does

    x <- sort(x)
    rval <- approxfun(x, (1:n)/n, method = "constant", yleft = 0,

                      yright = 1, f = 0, ties = "ordered")

_but_ it might be hard to recognize the situation you are in without much computation. Something along the lines of

    vals <- sort(unique(x))
    y <- tabulate(match(x, vals))
    rval <- approxfun(vals, cumsum(y)/n, method = "constant", yleft = 0,

                      yright = 1, f = 0, ties = "ordered")

should work better for you and may be little slower if there are no ties, but will use more memory.

A quick play suggests that the real problem is not with ecdf (at least not for me with x <- sample(1:200, 2e7, replace=TRUE)), but with plotting the result. Please investigate what might be a reasonable compromise.

On Sun, 17 Oct 2004 wrote:

> Full_Name: Martin Frith
> Version: R-2.0.0
> OS: linux-gnu
> Submission from: (NULL) (
> I have large vectors containing 100,000 to 20,000,000 numbers. However,
> they only contain a few hundred *distinct* numbers (e.g. positive
> integers < 200). When I do ecdf(v), it either runs out of memory, or it
> succeeds, but when I plot the ecdf with postscript, the output is
> unnecessarily bloated because the same lines get redrawn many times. The
> complexity of ecdf should depend on how many distinct numbers there are,
> not how many total numbers.
> This is my first bug report, so forgive me if I've done something stupid!

Brian D. Ripley,        
Professor of Applied Statistics,
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________ mailing list
Received on Sun Oct 17 17:29:40 2004

This archive was generated by hypermail 2.1.8 : Fri 18 Mar 2005 - 09:00:36 EST