Re: [R] Looking for a sort of tapply() to data frames

From: January Weiner <>
Date: Thu 15 Dec 2005 - 22:55:40 EST

Hello again,

On 12/14/05, Thomas Lumley <> wrote:
> You want
> by(df[,-1], df$Day, function.that.means.each.column)

OK, slowly :-) I don't understand it.

(by the way, why does typeof(df) show "list"? I thought that read.table() returns a data frame?)

> so all you need to do is write function.that.means.each.column()
> In this case there is a built-in function, colMeans, so you don't even
> have to write it.

Hmmmmm, I tried it and it did not work. That is, it works - but not as intended :-).

Fake example:

> df <- data.frame(Day=c("Tue","Tue","Tue", "Wed", "Wed"), val1=seq(1,5), val2=3*seq(1,5))
> df

  Day val1 val2
1 Tue 1 3
2 Tue 2 6
3 Tue 3 9
4 Wed 4 12
5 Wed 5 15
> ddf <- by(df[,-1], df$Day, colMeans)
> ddf
df$Day: Tue
val1 val2

   2 6

df$Day: Wed
val1 val2
 4.5 13.5
> ddf$Day

> ddf$val1

NULL In real data, instead of "days", I have around 6000 items, so I need them to be in one column called "Days" (or whatever). OK. So correct me if I understand wrongly what is happening here:

by() divides df in data frame subsets and applies a function (colMeans) to each of them. The result of colMeans ... manual says that colMeans returns the following:

     A numeric or complex array of suitable size, or a vector if the
     result is one-dimensional.  The 'dimnames' (or 'names' for a
     vector result) are taken from the original array.

...which doesn't tell me much. typeof(colMeans(...)) tells me "double" but I think it lies. OK, lets assume it is a vector (should be, I assume the result is one-dimensional, as I can hardly imagine a multidimensional result).

So in the end I have a list with as many columns as I have days, and in each column I have a vector with N named dimensions, where N is the numbers of variables in the original data frame bar one. But what I would like to have is a data frame with exactly the same column names, and rows being just a summary. And no clue how to convert one in the other :-)

> More generally (eg the approach would work for medians as well)
> by(df[,1], df$Day, function(today) apply(today, 2, mean))

Huh? why is it df[,1] now? I think I'm completly lost.

> Finally, you could just use aggregate().

Probably, yes. As soon as I figure out how to use it, that is :-) (an hour later: OK, I got it! yuppie!) However what I really needed was smth like this:

ddf <- by(df[,-1], df$Day, function(z) { return(cor(z$val1,z$val2)) ; } )

(but I still don't know how to convert it to a friendly data frame...)

Thanks for the answers!


------------ January Weiner 3  ---------------------+---------------
Division of Bioinformatics, University of Muenster  |  Schloßplatz 4
(+49)(251)8321634                                   |  D48149 Münster    |  Germany

______________________________________________ mailing list
PLEASE do read the posting guide!
Received on Thu Dec 15 23:16:57 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:41:39 EST