From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>

Date: Tue 07 Feb 2006 - 08:52:43 GMT

Date: Tue 07 Feb 2006 - 08:52:43 GMT

For the record, some of these claims are untrue:

> df(0, 2, 2)

[1] 1

*> df(0, 1.3, 2)
*

[1] Inf

*> x <- 1e-170
*

> pbeta(x, x, x)

[1] 0.5

qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is an error, and so is qnbinom(1E-300,0.000002,10000000000)

Here we can guess at what you meant, maybe correctly. There were comments in the source code about needing a better search, and I have recently implemented one. So

> qnbinom(1e-10, 1e3, 1e-7) # instant

[1] 8117986721

*> qnbinom(0.5, 10000000000, 0.000000002)
*

[1] 5e+18

*> qnbinom(1e-300, 10000000000, 0.000002)
*

[1] 4.998138e+15

seem to be solved.

There were two problems with dbeta, one easily overcome (f underflows) the other fundamental to the way dbinom_raw is computed (n*p can underflow). What I cannot see is why a formula which worked correctly in this region was replaced by one that did not. It is precisely in order not to generate such errors that I used TOMS 708 only in the area where the existing algorithm is problematic. (It may be better elswehere, but I did not have the time to do the requisite analysis. It seems neither do some other people: I would prefer not to spend the time to clear up after such unneeded changes.)

On pt, thank you for the report. pt(x, df=1) is not interesting for |x| > 1e150, but it is for smaller values of df and those were underflowing. It is easy to use an asymptotic formula to regain the accuracy.

R was potentially generating reports of lack of convergence and loss of accuracy in quite a number of its algorithms, but for reasons unknown to me these were being buried (ML_ERROR did nothing, and has not for a very long time). It's a matter of debate whether in some of these it would be better to return NaN as well, but warnings should have been generated (and now are).

As for `panic' (your word: why is it 'panic' to submit a correct bug report?), a major platform returning +Inf for a log probability is very bad news, as is another failing a regression test by getting NaN for a probability which is 0.5.

On Sat, 28 Jan 2006 IandJMSmith@aol.com wrote:

> On 23/02/05 I suggested that given R had included TOMS 708 to correct for t=

*> he=20
**> poor performance of pbeta, TOMS 654 should be included to fix all the pgamm=
**> a=20
**> problems. I was slightly surprised to find Morten's code had been included=
**> =20
**> instead 2 days later. I noticed but did not worry that the reference to me =
**> had=20
**> been removed.=20
**>
**> The derivation of the asymptotic expansion for the gamma distribution used =
**> by=20
**> Morten can be found at http://members.aol.com/iandjmsmith/PoissonApprox.htm=
**> =20
**> It is fairly easy to understand and find error bounds for and hence include=
**> =20
**> sensibly in an algorithm to calculate pgamma.
**>
**> The basis and accuracy of the some of the algorithms I use is discussed in=
**> =20
**> http://members.aol.com/iandjmsmith/Accuracy.htm In this case, the absolute =
**> error=20
**> in the log of the probability gives a good indication of the accuracy of yo=
**> ur=20
**> answer. In the least extreme example you consider=20
**> (pgamma(0.9*1e25,1e25,log=3DT)the absolute error would be about 5360515 and=
**> if you exponentiated the result=20
**> the relative error would be about 10 to the power 2328042. So the answer yo=
**> u=20
**> wish to calculate is K times 10 to the power -2.32804237034994E+22, where K=
**> is=20
**> somewhere between 10 to the power plus or minus 2328042. In other words whe=
**> n=20
**> you make the changes to correct this problem, your calculation will still=
**> =20
**> return values with no real meaning but at least users might be aware of thi=
**> s which=20
**> would be no bad thing! For me this answer is possibly so meaningless that N=
**> an=20
**> would be preferable.
**>
**> I did mention to Morten that I had updated my code but I believed that for=
**> =20
**> Gnumeric he was quite satisfied with what he had. If you look at the VBA co=
**> de at=20
**> http://members.aol.com/iandjmsmith/Examples.txt you can see the changes I=
**> =20
**> made to stop the overflow problems you seem to be worried about. My code fo=
**> r the=20
**> pdf of the gamma distribution still fails for shape parameters > 2e307 due =
**> to=20
**> multiplication of the shap parameter by 2pi. The code for dgamma will have =
**> the=20
**> same problem unless it is hidden by use of an 80 or more bit floating point=
**> =20
**> processor. The code for the asymptotic expansion for the gamma distribution=
**> =20
**> seems to be fine for any number, excluding silly ones like Nan and Inf. Ind=
**> eed it=20
**> takes the difference from the mean as a parameter and if you supply an=20
**> accurate value you get a sensible answer as mentioned in=20
**> http://members.aol.com/iandjmsmith/Accuracy.htm
**>
**> I do not share your apparent sense of panic on this matter. I have no=20
**> problems with error signals like NaNs because it is obvious to the user tha=
**> t things=20
**> have gone wrong. Inaccurate answers when the user has no reason to expect t=
**> hem=20
**> are usually far more difficult to spot and in many cases the results are ju=
**> st=20
**> believed. That for me is a serious problem. I think you will find that the=
**> =20
**> pgamma code of 2.0.0 did not work for small shape parameters (similar to th=
**> e=20
**> problems exhibited by pbeta still for small parameters see PR#2894), was=20
**> inaccurate for large shape parameters (> 1e5) when it resorted to the norma=
**> l=20
**> approximation and was pretty slow in between. Indeed, the normal approximat=
**> ion was the=20
**> cause of PR#7307.
**>
**>
**> I don't understand your comments about=20
**> "pt_ =3D -x * (log(1 + (lambda-x)/x) - (lambda-x)/x) =3D -x * log((lambda-x=
**> )/x) -=20
**> (lambda-x)=20
**> and naively assumes that this is small enough to use a power series expansi=
**> on=20
**> in 1/x with coefficients as powers of pt_. To make matters worse, consider =
**> =E2=80=A6"
**> In the example you go on to discuss, |(lambda-x)/x| is 0.1 and I don't thin=
**> k=20
**> it can be bigger than 0.2. Calculating log(1+x)-x is done several ways. If =
**> |x|=20
**> < .01 it is evaluated by a power series, if x < -0.8 or x > 1 it uses=20
**> log1p(1+x)-x and for other values it uses a continued fraction which essent=
**> ially=20
**> evaluates more of the same series used when |x| < .01.
**>
**> Your comments about replacing logspace_add with logspace_sub with simpler=
**> =20
**> code which works at first sight to be a very sensible improvement. However,=
**> I=20
**> would be a bit nervous that lnd-lnp could be very large and the exp of it c=
**> ould=20
**> return infinity. I'm sure this can be accounted for in the code and lnp +=
**> =20
**> log1p(f*exp(lnd-lnp))evaluated as lnp or log(f)+lnd accordingly.
**>
**> I am not responsible for the code for calculating the logs of probabilities=
**> =20
**> but I seem to remember that the extremely poor performance of the algorithm=
**> s in=20
**> R2.0.0 with logged probabilities was one of the reasons Morten became=20
**> interested in changing the pgamma code (see PR#7307). I have had a quick lo=
**> ok and=20
**> once the corections mentioned above are made it should be giving nonsense a=
**> nswers=20
**> with no difficulty.
**>
**>
**> Unfortunately there are still a few examples of sloppy coding and accuracy=
**> =20
**> errors remaining in R.
**>
**> The non-central distribution functions have horrible 1- cancellation errors=
**> =20
**> associated with them (see PR#7099) and separate code is required for the tw=
**> o=20
**> tails of the distributions to get round the problem.
**>
**> The fix for PR#8251 is a kludge and just moves the inaccuracies to examples=
**> =20
**> with higher non-centrality parameters.
**>
**> pt(x,1) will overflow or return 0 for values < -2e154 for 64-bit=20
**> implementations. pcauchy works but I believe the pt function is also suppos=
**> ed to work for=20
**> non integral degrees of freedom so making it work one degree of freedom via=
**> =20
**> pcauchy is hardly much use.
**>
**> qnbinom(1e-10,1e3,1e-7,TRUE,FALSE) is slow and by varying the
*

parameters,=

*> =20
*

> qnbinom can be made very slow indeed. I do not think there is anything wron=

*> g with=20
**> the Cornish-Fisher expansion. It just seems that it is not always very good=
**> =20
**> for the Negative Binomial distribution. In the example above, the initial=
**> =20
**> approximation is out by 2e6.
**>
**> A slightly different problem which can cause qnbinom and qbinom to go into=
**> =20
**> infinite loops is when the q-value is too big. For example=20
**> qnbinom(1E-300,0.000002,10000000000) should return 4.99813787561159E+15 app=
**> rox but the code works=20
**> with values where one of the statements y :=3D y +1 or y =3D y - 1; is exec=
**> uted but=20
**> does not alter the value of y.
**>
**> df(0,2,2,FALSE) should be 1 not 0
**> df(0,df,2,FALSE) should be infinity for df < 2 not 0
**> dbeta(1e-162,1e-162,1e-162,FALSE) should be 0.5 not 0
**>
**> Presumably R also has similar problems with the pbeta function. As I recall=
**> =20
**> the TOMS 708 code was pretty much included without edits and therefore didn=
**> 't=20
**> calculate logs of probabilities except by calculating the probability and t=
**> hen=20
**> logging it. I assumed this was why it was not used for small shape paramete=
**> rs=20
**> where the current code does not work, although it did not seem logical to m=
**> e.=20
**> Of course, my memory is not what it was but if that is the case and there a=
**> re=20
**> problems with modifying the TOMS code, you could try an
*

asymptotic expansio=

*> n=20
*

> based on http://members.aol.com/iandjmsmith/BinomialApprox.htm

*>
**> This response has been very rushed. I do not write well when I have plenty =
**> of=20
**> time and I felt I had so many different things to say so I apologise if it =
**> is=20
**> all a bit of a jumble.=20
**>
**> Ian Smith
**>
**>
**> [[alternative HTML version deleted]]
**>
**> ______________________________________________
**> R-devel@r-project.org mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-devel
**>
**>
*

-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-develReceived on Tue Feb 07 19:59:31 2006

*
This archive was generated by hypermail 2.1.8
: Mon 20 Feb 2006 - 03:21:40 GMT
*