Re: [R] Bug in lowess

From: Berton Gunter <gunter.berton_at_gene.com>
Date: Fri 13 Oct 2006 - 17:04:35 GMT


Folks:

This interesting dicussion brings up an issue of what I have referred to for some time as "safe statistics," by which I mean:

Usually, but not necessarily automated)Statistical procedures that are guranteed to give either

(a) a "reasonable" answer; or
(b) Do not give an answer and when possible emit "useful" error messages.

All standard least squares procedures taught in basic statistics courses are examples (from many different perspectives) of unsafe statistics. Robustness/resistance clearly takes us some way along this path, but as is clear from the discussion, not the whole way. The reason I think that this is important is

  1. Based on my own profound ignorance/limitations, I think it's impossible to expect those who need to use many different kinds of sophisticated statistical analyses to understand enough of the technical details to be able to actively and effectively guide their appropriate when this requires such guidance (e.g., least aquares with extensive diagnostics; overfitting in nonlinear regression);
  2. The explosion of large complex data in all disciplines that **require** some sort of automated analyses to be used (e.g. microarray data?).

Having said this, it is unclear to me even **if** "safe statistics" is a meaningful concept: can it ever be -- at all? But I believe one thing is clear: A lot of people devote a lot of labor to "optimal" procedures that are far too sensitive to the manifold peculiarities of real data to give reliable, trustworthy results in practice considerable expert coaxing. We at least need a greater variety of less optimal but safer data analysis procedures. R -- or rather it's many contributors-- seems to me to be the exception in recognizing and doing something about this.

And as a humble example of what I mean: I like simple running medians of generally small span for "smoothing" sequential data (please don't waste time giving me counterexamples of why this is bad or how it can go wrong -- I know there are many).

I would appreciate anyone else's thoughts on this, pro or con, perhaps privately rather than on the list if you view this as too far off-topic.

(NOTE: TO be clear: My personal views, not those of my company or colleagues)

My best regards to all,

Bert

Bert Gunter
Genentech Nonclinical Statistics
South San Francisco, CA 94404

-----Original Message-----
From: r-help-bounces@stat.math.ethz.ch
[mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Frank E Harrell Jr Sent: Friday, October 13, 2006 5:51 AM
To: Prof Brian Ripley
Cc: R-help@r-project.org
Subject: Re: [R] Bug in lowess

Prof Brian Ripley wrote:
> Frank Harrell wrote:
>
> [...]
>

>> Thank you Brian.  It seems that no matter what is the right answer, the
>> answer currently returned on my system is clearly wrong.  lowess()$y
>> should be constrained to be within range(y).

>
> Really? That assertion is offered without proof and I am pretty sure is
> incorrect. Consider
>
>> x <- c(1:10, 20)
>> y <- c(1:10, 5) + 0.01*rnorm(11)
>> lowess(x,y)

> $x
> [1] 1 2 3 4 5 6 7 8 9 10 20
>
> $y
> [1] 0.9983192 1.9969599 2.9960805 3.9948224 4.9944158 5.9959855
> [7] 6.9964400 7.9981434 8.9990607 10.0002567 19.9946117
>
> Remember that lowess is a local *linear* fitting method, and may give
> zero weight to some data points, so (as here) it can extrapolate.

Brian - thanks - that's a good example though not typical of the kind I see from patients.

>
> After reading what src/appl/lowess.doc says should happen with zero
> weights, I think the answer given on Frank's system probably is the
> correct one. Rounding error is determining which of the last two points
> is given zero robustness weight: on a i686 system both of the last two
> are, and on mine only the last is. As far as I can tell in
> infinite-precision arithmetic both would be zero, and hence the value at
> x=120 would be found by extrapolation from those (far) to the left of it.
>
> I am inclined to think that the best course of action is to quit with a
> warning when the MAD of the residuals is effectively zero. However, we
> need to be careful not to call things 'bugs' that we do not understand
> well enough. This might be a design error in lowess, but it is not
> AFAICS a bug in the implementation.

Yes it appears to be a weakness in the underlying algorithm.

Thanks
Frank



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat Oct 14 15:40:34 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Sat 14 Oct 2006 - 06:30:10 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.