Re: [Rd] max.col oddity

From: Duncan Murdoch <murdoch_at_stats.uwo.ca>
Date: Sat 16 Dec 2006 - 17:45:40 GMT

On 12/16/2006 12:25 PM, John Zedlewski wrote:
> Duncan--
> Ah, good point, clearly setting the tolerance to 0 is bad in that
> case. Also, my code has another problem when the max is negative -- it
> will set a negative tolerance! One good fix for both problems is the
> following: set the initial value of "large" to the first value in the
> row instead of 0.0, then keep the "fmax2(a,large)" approach from my
> patch, but at the end, take the absolute value of large. That will
> always follow the current docs -- using the largest value, not the
> largest absolute value in the row, for comparison.
>
> You proposal of changing the docs and just fixing the infinite
> problem sounds like a reasonable approach too, especially if people
> are already depending on this behavior (??), although I still think
> it's s a little weird that comparing "(-1e10, 2, 3)" will say that 2
> or 3 could be the max.

I think the explanation for this is in the man page: "In this case, the determination of "tie" assumes that the entries are probabilities..." If you transform that vector linearly to force it into the [0,1] range of probabilities, the behaviour is clear:

 > x <- c(-1e10, 2, 3)
 > (x - min(x))/diff(range(x))
[1] 0 1 1

Duncan Murdoch

> Thanks,
> --JRZ
>

>> I think I'd prefer to be clear about what "largest" means in the docs
>> rather than dropping the absolute value, because if all entries are
>> negative, your version sets the tolerance to 0.

>
>
> On 12/16/06, Duncan Murdoch <murdoch@stats.uwo.ca> wrote:
>> On 12/15/2006 7:09 PM, John Zedlewski wrote:
>>> I've noticed that the max.col function with the default "random"
>>> option often gives unexpected results. For instance, in this test, it
>>> seems clear what the answer should be:
>>>
>>>> # second col should always be max
>>>> x1 = cbind(1:10, 2:11, -Inf)
>>>>
>>>> # this works fine
>>>> max.col(x1, "first")
>>>  [1] 2 2 2 2 2 2 2 2 2 2
>>>> # this gives random answers
>>>> max.col(x1)
>>>> [1] 3 1 1 2 3 3 1 3 1 1
>>> Ouch! max.col is randomizing across all values.
>>> Even without infinite values, something similar can happen:
>>>
>>>> # test 2 --- tolerance problems
>>>>
>>>> # clearly column 3 is the max
>>>> x1 = cbind(-1e9 * 1:10, 1:10, 2:11)
>>>>
>>>> # again, first method works:
>>>> max.col(x1, "first")
>>>  [1] 3 3 3 3 3 3 3 3 3 3
>>>> # but random doesn't
>>>> max.col(x1)
>>>  [1] 2 3 2 3 3 2 2 2 3 2
>>>
>>> The max.col docs say " there is a relative tolerance of 1e-5, relative
>>> to the largest entry in the row", but it's really using the maximum
>>> absolute value entry in the row (appl/maxcol.c, line 35 in R 2.4.0).
>>> Is this necessary for some sort of S-plus compatibility? If so, I
>>> think it would be good to make this absolute value property very clear
>>> in the docs, since it can cause subtle bugs (and did for me).
>>>
>>> Personally, I think the behavior is much nicer with the following patch:
>>>
>>> --- rplain/R-2.4.0/src/appl/maxcol.c    2006-04-09 18:19:58.000000000 -0400
>>> +++ R-2.4.0/src/appl/maxcol.c   2006-12-14 15:30:56.000000000 -0500
>>> @@ -26,13 +26,14 @@
>>>        do_rand = *ties_meth == 1;
>>>
>>>     for (r = 0; r < n_r; r++) {
>>> -       /* first check row for any NAs and find the largest abs(entry) */
>>> +       /* first check row for any NAs and find the largest entry */
>>>        large = 0.0;
>>>        isna = FALSE;
>>>        for (c = 0; c < *nc; c++) {
>>>            a = matrix[r + c * n_r];
>>>            if (ISNAN(a)) { isna = TRUE; break; }
>>> -           if (do_rand) large = fmax2(large, fabs(a));
>>> +           if (!R_FINITE(a)) continue;
>>> +           if (do_rand) large = fmax2(large, a);
>>>        }
>>>        if (isna) { maxes[r] = NA_INTEGER; continue; }
>>>
>>> ---------------- END   ----------------------
>>>
>>> This gives the expected behavior in the two examples above.
>> I think I'd prefer to be clear about what "largest" means in the docs
>> rather than dropping the absolute value, because if all entries are
>> negative, your version sets the tolerance to 0.
>>
>> But ignoring infinite values when calculating the largest absolute value
>> might be a good idea:  I can't see why someone would want an infinite
>> tolerance.
>>
>> So I'd propose to add your R_FINITE check, and make this change to the docs:
>>
>> " there is a relative tolerance of 1e-5, relative
>>    to the largest finite absolute value in the row"
>>
>> The copyright notice in the file says this function is from MASS/MASS.c
>> by W. N. Venables and B. D. Ripley, so I'd like to hear from at least
>> one of them before making a change.
>>
>> Duncan Murdoch
>>
>>> (Sorry to crosspost to both this list and R-help, but it was suggested
>>> that R-devel would be a more appropriate forum for this.)
>>>
>>> ______________________________________________
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>

>
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sun Dec 17 07:01:13 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Sun 17 Dec 2006 - 00:31:03 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.