Re: [Rd] A suggestion for an amendment to tapply

From: Peter Dalgaard <p.dalgaard_at_biostat.ku.dk>
Date: Wed, 07 Nov 2007 08:15:17 +0100

Andrew Robinson wrote:
> These are important concerns. It seems to me that adding an argument
> as suggested by Bill will allow the user to side-step the problem
> identified by Brian.
>
> Bill, under what kinds of circumstances would you anticipate a
> significant time penalty? I would be happy to check those out with
> some simulations.
>
> If the timing seems acceptable, I can write a patch for tapply.R and
> tapply.Rd if anyone in the core is willing to consider them. Please
> contact me on or off list if so.
>
>

There's another concern: tapply (et al.) has the ... args passed on to FUN which means that you have to be really careful with argument names.

Could I just interject that we already have

 > airquality$Month <- factor(airquality$Month,levels=4:9) # April not there  > unlist(lapply(
+ split(airquality$Ozone, airquality$Month, drop=F),sum, na.rm=T))

   4 5 6 7 8 9
   0 614 265 1537 1559 912

(splitting on multiple factors gets a bit involved, though)

> Best wishes to all,
>
> Andrew
>
>
>
>
> On Tue, Nov 06, 2007 at 07:23:56AM +0000, Prof Brian Ripley wrote:
>
>> On Tue, 6 Nov 2007, Bill.Venables_at_csiro.au wrote:
>>
>>
>>> Unfortunately I think it would break too much existing code. tapply()
>>> is an old function and many people have gotten used to the way it works
>>> now.
>>>
>> It is also not necessarily desirable: FUN(numeric(0)) might be an error.
>> For example:
>>
>>
>>> Z <- data.frame(x=rnorm(10), f=rep(c("a", "b"), each=5))[1:5, ]
>>> tapply(Z$x, Z$f, sd)
>>>
>> but sd(numeric(0)) is an error. (Similar things involving var are 'in the
>> wild' and so would be broken.)
>>
>>
>>> This is not to suggest there could not be another argument added at the
>>> end to indicate that you want the new behaviour, though. e.g.
>>>
>>> tapply <- function (X, INDEX, FUN=NULL, ..., simplify=TRUE,
>>> handle.empty.levels = FALSE)
>>>
>>> but this raises the question of what sort of time penalty the
>>> modification might entail. Probably not much for most situations, I
>>> suppose. (I know this argument name looks long, but you do need a
>>> fairly specific argument name, or it will start to impinge on the ...
>>> argument.)
>>>
>>> Just some thoughts.
>>>
>>> Bill Venables.
>>>
>>> Bill Venables
>>> CSIRO Laboratories
>>> PO Box 120, Cleveland, 4163
>>> AUSTRALIA
>>> Office Phone (email preferred): +61 7 3826 7251
>>> Fax (if absolutely necessary): +61 7 3826 7304
>>> Mobile: +61 4 8819 4402
>>> Home Phone: +61 7 3286 7700
>>> mailto:Bill.Venables_at_csiro.au
>>> http://www.cmis.csiro.au/bill.venables/
>>>
>>> -----Original Message-----
>>> From: r-devel-bounces_at_r-project.org
>>> [mailto:r-devel-bounces_at_r-project.org] On Behalf Of Andrew Robinson
>>> Sent: Tuesday, 6 November 2007 3:10 PM
>>> To: R-Devel
>>> Subject: [Rd] A suggestion for an amendment to tapply
>>>
>>> Dear R-developers,
>>>
>>> when tapply() is invoked on factors that have empty levels, it returns
>>> NA. This behaviour is in accord with the tapply documentation, and is
>>> reasonable in many cases. However, when FUN is sum, it would also
>>> seem reasonable to return 0 instead of NA, because "the sum of an
>>> empty set is zero, by definition."
>>>
>>> I'd like to raise a discussion of the possibility of an amendment to
>>> tapply.
>>>
>>> The attached patch changes the function so that it checks if there are
>>> any empty levels, and if there are, replaces the corresponding NA
>>> values with the result of applying FUN to the empty set. Eg in the
>>> case of sum, it replaces the NA with 0, whereas with mean, it replaces
>>> the NA with NA, and issues a warning.
>>>
>>> This change has the following advantage: tapply and sum work better
>>> together. Arguably, tapply and any other function that has a non-NA
>>> response to the empty set will also work better together.
>>> Furthermore, tapply shows a warning if FUN would normally show a
>>> warning upon being evaluated on an empty set. That deviates from
>>> current behaviour, which might be bad, but also provides information
>>> that might be useful to the user, so that would be good.
>>>
>>> The attached script provides the new function in full, and
>>> demonstrates its application in some simple test cases.
>>>
>>> Best wishes,
>>>
>>> Andrew
>>>
>>>
>> --
>> Brian D. Ripley, ripley_at_stats.ox.ac.uk
>> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
>> University of Oxford, Tel: +44 1865 272861 (self)
>> 1 South Parks Road, +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK Fax: +44 1865 272595
>>
>
>

-- 
   O__  ---- Peter Dalgaard             ุster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard_at_biostat.ku.dk)                  FAX: (+45) 35327907

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Wed 07 Nov 2007 - 07:23:05 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 08 Nov 2007 - 01:30:15 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.