Re: [R] Subset by Factor by date

From: Charilaos Skiadas <cskiadas_at_gmail.com>
Date: Sat, 14 Jun 2008 08:09:10 -0400

On Jun 14, 2008, at 2:59 AM, T.D.Rudolph wrote:

>
> I can't speak to the intricacies of the formula but when I run the
> ByDataFrame() function provided on a subsample of my data (n=50) it
> returned
> only the very first id value in the output; the rest came out as
> <NA>....
> This is not to say it has not properly selected the rows with min(x
> $diff),
> but I have no way of verifying without the id membership in the
> output.

And equally we can't help you with that without a reproducible example. Doesn't it do the right thing in the little sample I posted? It moves the id and day columns to the end. Without that, the only thing I can think of that might cause trouble is that you have a matrix instead of a data.frame, or otherwise the columns have some class I have not anticipated. Perhaps you can send me a part of your data off-list, if you can't post it here?

Haris Skiadas
Department of Mathematics and Computer Science Hanover College

> Charilaos Skiadas-3 wrote:
>>
>>
>> On Jun 14, 2008, at 1:25 AM, T.D.Rudolph wrote:
>>
>>>
>>> aggregate() is indeed a useful function in this case, but it only
>>> returns the
>>> columns by which it was grouped. Is there a way I can use this
>>> while
>>> simultaneously retaining all the other column values in the
>>> dataframe?
>>>
>>> e.g. add superfluous (yet pertinent for later) column containing any
>>> information at all and retain it in the final output
>>
>> I had exactly this kind of need many times, and I have finally
>> created a function for it, which I hope to include soon in an
>> upcoming package. Here is a run of it (I added an extra "A" column
>> containing just the numbers 1:8):
>>
>>> DF
>> id day diff A
>> 1 1 01-01-09 0.5 1
>> 2 1 01-01-09 0.7 2
>> 3 2 01-01-09 0.2 3
>> 4 2 01-01-09 0.4 4
>> 5 1 01-02-09 0.1 5
>> 6 1 01-02-09 0.3 6
>> 7 2 01-02-09 0.3 7
>> 8 2 01-02-09 0.4 8
>>> byDataFrame(DF, list(id, day), function(x) x[which.min(x$diff),])
>> diff A id day
>> 1 0.5 1 1 01-01-09
>> 2 0.2 3 2 01-01-09
>> 3 0.1 5 1 01-02-09
>> 4 0.3 7 2 01-02-09
>>
>> Would that do what you want?
>>
>> I've appended the function byDataFrame, and its prerequisite, a
>> function parseIndexList. I'm not quite set on the names yet, but
>> anyway. Hope this helps. I haven't really tested it on large sets, it
>> might perform poorly. Any suggestions on speeding the code /
>> corrections are welcome.
>>
>> Haris Skiadas
>> Department of Mathematics and Computer Science
>> Hanover College
>>
>>
>>
>> parseIndexList <- function(indexList) {
>> # browser()
>> if (!is.list(indexList))
>> indexList <- as.list(indexList)
>> nI <- length(indexList)
>> namelist <- vector("list", nI)
>> names(namelist) <- names(indexList)
>> extent <- integer(nI)
>> nx <- length(indexList[[1]])
>> one <- as.integer(1)
>> group <- rep.int(one, nx)
>> ngroup <- one
>> for (i in seq.int(indexList)) {
>> index <- as.factor(indexList[[i]])
>> if (length(index) != nx)
>> stop("arguments must have same length")
>> namelist[[i]] <- sort(unique(indexList[[i]]))
>> extent[i] <- length(namelist[[i]])
>> group <- group + ngroup * (as.integer(index) - one)
>> ngroup <- ngroup * nlevels(index)
>> }
>> nms <- do.call(expand.grid, namelist)
>> ind <- unique(sort(group))
>> res <- data.frame(index=ind, nms[ind, , drop=FALSE])
>> return(list(cases=group, groups=res))
>> }
>>
>> byDataFrame <- function (data, INDEX, FUN, newnames,
>> omit.index.cols=TRUE, ...) {
>> # # Part of the code shamelessly stolen from tapply
>> IND <- eval(substitute(INDEX), data)
>> nms <- as.character(as.list(substitute(INDEX)))
>> if (!is.list(IND)) {
>> IND <- list(IND)
>> names(IND) <- nms
>> } else {
>> names(IND) <- nms[-1]
>> }
>> funname <- paste(as.character(substitute(FUN)), collapse=".")
>> indexInfo <- parseIndexList(IND)
>> FUNx <- if (omit.index.cols) {
>> omit.cols <- match(names(indexInfo$groups)[-1], names(data))
>> function(x, ...) FUN(data[x, -omit.cols], ...)
>> } else {
>> function(x, ...) FUN(data[x, ], ...)
>> }
>> ans <- lapply(split(1:nrow(data), indexInfo$cases), FUNx, ...)
>> index <- as.numeric(names(ans))
>> if (!is.data.frame(ans[[1]])) {
>> ans <- lapply(ans, function(x) {
>> dframe <- as.data.frame(t(x))
>> if (is.null(names(x)))
>> names(dframe) <- funname
>> dframe
>> })
>> }
>> lengths <- sapply(ans, nrow)
>> ans <- do.call(rbind, ans)

>> if (!missing(newnames))
>> names(ans) <- newnames
>> nms <- indexInfo$groups[rep(index, lengths),-1, drop=FALSE]
>> res <- cbind(ans, nms)
>> res
>> }



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 14 Jun 2008 - 12:13:51 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 14 Jun 2008 - 15:00:44 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive