Re: [R] by inconsistently strips class - with fix

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Thu, 17 Apr 2008 07:03:33 +0100 (BST)

Unfortunately your proposed change changes the type of the output: simplification is intended in many applications of by().

Before:

> str(by(mytimes$date[1], mytimes$set[1], function(x)x))
  by [, 1] 1.21e+09

After:

> str(by(mytimes$date[1], mytimes$set[1], function(x)x))
List of 1
  $ 1: POSIXct[1:1], format: "2008-04-17 06:53:31"

c() does not do the same thing as unlist() in general, and it is untrue that 'c does not strip class'. What happens in your example is that there is a c() method for your class (and not many others).

What we could is to add a 'simplify' argument to by() so you can control the simplification.

On Tue, 15 Apr 2008, Alex Brown wrote:

> summary:
>
> The function 'by' inconsistently strips class from the data to which
> it is applied.
>
> quick reason:
>
> tapply strips class when simplify is set to TRUE (the default) due to
> the class stripping behaviour of unlist.
>
> quick answer:
>
> This can be fixed by invoking tapply with simplify=FALSE, or changing
> tapply to use do.call(c instead of unlist
>
> executable example:
>
> mytimes=data.frame(date = 1:3 + Sys.time(), set = c(1,1,2))
>
> by(mytimes$date, mytimes$set, function(x)x)
>
> INDICES: 1
> [1] "2008-04-15 11:41:38 BST" "2008-04-15 11:41:39 BST"
> ----------------------------------------------------------------------------------------
> INDICES: 2
> [1] "2008-04-15 11:41:40 BST"
>
> by(mytimes[1,]$date, mytimes[1,]$set, function(x)x)
>
> INDICES: 1
> [1] 1208256099
>
> why this is a problem:
>
> This is a problem when you are feeding the output of this by into a
> function which expects the class to be maintained. I see this problem
> when constructing
>
> reason:
>
> tapply strips class when simplify is set to TRUE (the default) due to
> the behaviour of unlist:
>
> "Where possible the list elements are coerced to a common mode during
> the unlisting, and so the result often ends up as a character vector.
> Vectors will be coerced to the highest type of the components in the
> hierarchy NULL < raw < logical < integer < real < complex < character
> < list < expression: pairlists are treated as lists."
>
> solution:
>
> This problem can be fixed in the function by.data.frame by modifying
> the call to tapply in the function "by":
>
> by.data.frame = function (data, INDICES, FUN, ...)
> {
> if (!is.list(INDICES)) {
> IND <- vector("list", 1)
> IND[[1]] <- INDICES
> names(IND) <- deparse(substitute(INDICES))[1]
> }
> else IND <- INDICES
> FUNx <- function(x) FUN(data[x, ], ...)
> nd <- nrow(data)
> <<<<
> ans <- eval(substitute(tapply(1:nd, IND, FUNx)), data)
> ====
> ans <- eval(substitute(tapply(1:nd, IND, FUNx, simplify=FALSE)),
> data)
> >>>>
> attr(ans, "call") <- match.call()
> class(ans) <- "by"
> ans
> }
>
> alternative solution:
>
> the call in tapply to unlist(ans, recursive=F) can be replaced by
> do.call(c,ans, recursive=F) to fix this issue, since c does not strip
> class.
>
> However, I haven't taken the time to work out if this will work in all
> cases.
>
> for example:
>
> function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
> {
> FUN <- if (!is.null(FUN))
> match.fun(FUN)
> if (!is.list(INDEX))
> INDEX <- list(INDEX)
> nI <- length(INDEX)
> namelist <- vector("list", nI)
> names(namelist) <- names(INDEX)
> extent <- integer(nI)
> nx <- length(X)
> one <- 1L
> group <- rep.int(one, nx)
> ngroup <- one
> for (i in seq.int(INDEX)) {
> index <- as.factor(INDEX[[i]])
> if (length(index) != nx)
> stop("arguments must have same length")
> namelist[[i]] <- levels(index)
> extent[i] <- nlevels(index)
> group <- group + ngroup * (as.integer(index) - one)
> ngroup <- ngroup * nlevels(index)
> }
> if (is.null(FUN))
> return(group)
> ans <- lapply(split(X, group), FUN, ...)
> index <- as.integer(names(ans))
> if (simplify && all(unlist(lapply(ans, length)) == 1)) {
> ansmat <- array(dim = extent, dimnames = namelist)
> <<<<
> ans <- unlist(ans, recursive = FALSE)
> ====
> ans <- do.call(c, ans, recursive = FALSE)
> >>>>
> }
> else {
> ansmat <- array(vector("list", prod(extent)), dim = extent,
> dimnames = namelist)
> }
> if (length(index)) {
> names(ans) <- NULL
> ansmat[index] <- ans
> }
> ansmat
> }
>
> Alexander Brown
> Principal Engineer
> Transitive
> Maybrook House, 40 Blackfriars Street, Manchester M3 2EG
> Phone: +44 (0)161 836 2321 Fax: +44 (0)161 836 2399 Mobile: +44
> (0)7980 708 221
> www.transitive.com
> * The leader in cross-platform virtualization
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley_at_stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Thu 17 Apr 2008 - 07:21:41 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 17 Apr 2008 - 12:30:30 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive