Re: [Rd] ecdf with lots of ties is inefficient (PR#7292)

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Sun 17 Oct 2004 - 17:24:04 EST


This seems a _very_ unusual use of ecdf -- what are you using it for that a sample of size 10,000 would not do equally well?

If you have a need for a more efficient version of ecdf, please develop one and submit a patch. I don't think it would be hard as ecdf does

    x <- sort(x)
    rval <- approxfun(x, (1:n)/n, method = "constant", yleft = 0,

                      yright = 1, f = 0, ties = "ordered")

_but_ it might be hard to recognize the situation you are in without much computation. Something along the lines of

    vals <- sort(unique(x))
    y <- tabulate(match(x, vals))
    rval <- approxfun(vals, cumsum(y)/n, method = "constant", yleft = 0,

                      yright = 1, f = 0, ties = "ordered")

should work better for you and may be little slower if there are no ties, but will use more memory.

A quick play suggests that the real problem is not with ecdf (at least not for me with x <- sample(1:200, 2e7, replace=TRUE)), but with plotting the result. Please investigate what might be a reasonable compromise.

On Sun, 17 Oct 2004 martin@gsc.riken.jp wrote:

> Full_Name: Martin Frith
> Version: R-2.0.0
> OS: linux-gnu
> Submission from: (NULL) (134.160.83.73)
>
>
> I have large vectors containing 100,000 to 20,000,000 numbers. However,
> they only contain a few hundred *distinct* numbers (e.g. positive
> integers < 200). When I do ecdf(v), it either runs out of memory, or it
> succeeds, but when I plot the ecdf with postscript, the output is
> unnecessarily bloated because the same lines get redrawn many times. The
> complexity of ecdf should depend on how many distinct numbers there are,
> not how many total numbers.
>
> This is my first bug report, so forgive me if I've done something stupid!

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-devel@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Sun Oct 17 17:29:40 2004

This archive was generated by hypermail 2.1.8 : Fri 18 Mar 2005 - 09:00:36 EST