From: Charilaos Skiadas <cskiadas_at_gmail.com>

Date: Sat, 14 Jun 2008 08:09:10 -0400

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 14 Jun 2008 - 12:13:51 GMT

Date: Sat, 14 Jun 2008 08:09:10 -0400

On Jun 14, 2008, at 2:59 AM, T.D.Rudolph wrote:

*>
*

> I can't speak to the intricacies of the formula but when I run the

*> ByDataFrame() function provided on a subsample of my data (n=50) it
**> returned
**> only the very first id value in the output; the rest came out as
**> <NA>....
**> This is not to say it has not properly selected the rows with min(x
**> $diff),
**> but I have no way of verifying without the id membership in the
**> output.
*

And equally we can't help you with that without a reproducible example. Doesn't it do the right thing in the little sample I posted? It moves the id and day columns to the end. Without that, the only thing I can think of that might cause trouble is that you have a matrix instead of a data.frame, or otherwise the columns have some class I have not anticipated. Perhaps you can send me a part of your data off-list, if you can't post it here?

Haris Skiadas

Department of Mathematics and Computer Science
Hanover College

> Charilaos Skiadas-3 wrote:

*>>
**>>
**>> On Jun 14, 2008, at 1:25 AM, T.D.Rudolph wrote:
**>>
**>>>
**>>> aggregate() is indeed a useful function in this case, but it only
**>>> returns the
**>>> columns by which it was grouped. Is there a way I can use this
**>>> while
**>>> simultaneously retaining all the other column values in the
**>>> dataframe?
**>>>
**>>> e.g. add superfluous (yet pertinent for later) column containing any
**>>> information at all and retain it in the final output
**>>
**>> I had exactly this kind of need many times, and I have finally
**>> created a function for it, which I hope to include soon in an
**>> upcoming package. Here is a run of it (I added an extra "A" column
**>> containing just the numbers 1:8):
**>>
**>>> DF
**>> id day diff A
**>> 1 1 01-01-09 0.5 1
**>> 2 1 01-01-09 0.7 2
**>> 3 2 01-01-09 0.2 3
**>> 4 2 01-01-09 0.4 4
**>> 5 1 01-02-09 0.1 5
**>> 6 1 01-02-09 0.3 6
**>> 7 2 01-02-09 0.3 7
**>> 8 2 01-02-09 0.4 8
**>>> byDataFrame(DF, list(id, day), function(x) x[which.min(x$diff),])
**>> diff A id day
**>> 1 0.5 1 1 01-01-09
**>> 2 0.2 3 2 01-01-09
**>> 3 0.1 5 1 01-02-09
**>> 4 0.3 7 2 01-02-09
**>>
**>> Would that do what you want?
**>>
**>> I've appended the function byDataFrame, and its prerequisite, a
**>> function parseIndexList. I'm not quite set on the names yet, but
**>> anyway. Hope this helps. I haven't really tested it on large sets, it
**>> might perform poorly. Any suggestions on speeding the code /
**>> corrections are welcome.
**>>
**>> Haris Skiadas
**>> Department of Mathematics and Computer Science
**>> Hanover College
**>>
**>>
**>>
**>> parseIndexList <- function(indexList) {
**>> # browser()
**>> if (!is.list(indexList))
**>> indexList <- as.list(indexList)
**>> nI <- length(indexList)
**>> namelist <- vector("list", nI)
**>> names(namelist) <- names(indexList)
**>> extent <- integer(nI)
**>> nx <- length(indexList[[1]])
**>> one <- as.integer(1)
**>> group <- rep.int(one, nx)
**>> ngroup <- one
**>> for (i in seq.int(indexList)) {
**>> index <- as.factor(indexList[[i]])
**>> if (length(index) != nx)
**>> stop("arguments must have same length")
**>> namelist[[i]] <- sort(unique(indexList[[i]]))
**>> extent[i] <- length(namelist[[i]])
**>> group <- group + ngroup * (as.integer(index) - one)
**>> ngroup <- ngroup * nlevels(index)
**>> }
**>> nms <- do.call(expand.grid, namelist)
**>> ind <- unique(sort(group))
**>> res <- data.frame(index=ind, nms[ind, , drop=FALSE])
**>> return(list(cases=group, groups=res))
**>> }
**>>
**>> byDataFrame <- function (data, INDEX, FUN, newnames,
**>> omit.index.cols=TRUE, ...) {
**>> # # Part of the code shamelessly stolen from tapply
**>> IND <- eval(substitute(INDEX), data)
**>> nms <- as.character(as.list(substitute(INDEX)))
**>> if (!is.list(IND)) {
**>> IND <- list(IND)
**>> names(IND) <- nms
**>> } else {
**>> names(IND) <- nms[-1]
**>> }
**>> funname <- paste(as.character(substitute(FUN)), collapse=".")
**>> indexInfo <- parseIndexList(IND)
**>> FUNx <- if (omit.index.cols) {
**>> omit.cols <- match(names(indexInfo$groups)[-1], names(data))
**>> function(x, ...) FUN(data[x, -omit.cols], ...)
**>> } else {
**>> function(x, ...) FUN(data[x, ], ...)
**>> }
**>> ans <- lapply(split(1:nrow(data), indexInfo$cases), FUNx, ...)
**>> index <- as.numeric(names(ans))
**>> if (!is.data.frame(ans[[1]])) {
**>> ans <- lapply(ans, function(x) {
**>> dframe <- as.data.frame(t(x))
**>> if (is.null(names(x)))
**>> names(dframe) <- funname
**>> dframe
**>> })
**>> }
**>> lengths <- sapply(ans, nrow)
**>> ans <- do.call(rbind, ans)
**>> if (!missing(newnames))
**>> names(ans) <- newnames
**>> nms <- indexInfo$groups[rep(index, lengths),-1, drop=FALSE]
**>> res <- cbind(ans, nms)
**>> res
**>> }
*

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 14 Jun 2008 - 12:13:51 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Sat 14 Jun 2008 - 15:00:44 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*