[R] Coercing by/tapply to data.frame for more than two indices?

From: Adam D. I. Kramer <adik_at_ilovebacon.org>
Date: Fri, 02 May 2008 15:43:00 -0700 (PDT)

Dear Colleagues,

         Apologies for a long email to ask what I feel may be a very simple question; I figure it's better to overspecify my situation.

         I was asked a question, recently, by a colleague in my department about pre-aggregating variables, i.e., computing the mean of defined subsets of a data frame. Naturally, I thought of the 'by' and 'tapply' functions, as they have always been the solution for me. However, my colleague had three indices, and as such needs to pay attention to the indices of the output...this is to say, the "create an array" function of tapply doesn't quite work because an array is not quite what we want.

         Consider this data set:

df <- data.frame(var1= factor(rep(rep(1:5,25*5),10)),

                  var2= factor(rep(rep(1:5,each=25*5),10),
                 trial= rep(rep(1:25,25),10),
                    id= factor(rep(1:10,each=5*5*25)),
                 score= rnorm(n=5*5*25*10) )

...this is to say, each of 10 ids has scores for 5 different levels of
var1 and 5 different levels of var2...across 25 trials. Basically, a three-way crossed repeated measures design...where tapply does what I want for a two-way design, it does not quite suit my purposes for a 3-way or n-way for n > 2.

The goal is to predict score from var1 and var2. The straightforward guess of what to do would be to simply have the AOV function aggregate across trials:

aov(score ~ var1*var2 + Error(id/(var1*var2)), data=df)

(or lm with defined contrasts)

...however, there are missing data on some trials for some people, which
makes this design unbalanced (i.e., it introduces a correlation between var1 and var2). Because my colleague knows (from a theoretical standpoint) that he wants to analyze the mean, his ANOVA on the aggregated trial means WOULD be balanced, which is to say, the analysis he wants to run would produce different output from the above.

So, what he needs is a data frame with four variables instead of five: var1, var2, id, and mscore (mean score), which has been averaged across trials.

Clearly (to me, it seems), the way to do this is with tapply:

x <- tapply(df$score, list(df$var1,df$var2,df$id), mean, na.rm=TRUE)

...which returns a var1*var2 matrix for each ID, when what I want is a
observation-per-row data frame.

So, my question: How do I end up with what I'm looking for?

My current process involves setting df2 <- data.frame(mscore=c(x), ...) where ... is a bunch of factor(rep) columns that would specify the var1 var2 and id levels. My problem with this approach is that it seems like a hack; it is not a general solution because I must use knowledge of the process by which x was generated in order to "get it right," and there's a decent amount of room for unnoticed error on my part.

I suppose what I'm looking for is either a way to take by or tapply and have it return a set of index variable columns based on the list of indices I provide to it...or a way to collapse an n-way table into a single data frame with index variables. Any suggestions?


Adam D. I. Kramer
Ph.D. Candidate, Social Psychology
University of Oregon

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 03 May 2008 - 04:58:10 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 03 May 2008 - 05:30:34 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive