From: Adam D. I. Kramer <adik-rhelp_at_ilovebacon.org>

Date: Sat, 03 May 2008 13:46:33 -0700 (PDT)

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 03 May 2008 - 20:50:38 GMT

Date: Sat, 03 May 2008 13:46:33 -0700 (PDT)

Thanks very much...it is exactly what I needed, and I'm a bit embarassed that I couldn't find it on my own.

One might consider adding "aggregate" to the "See also:" lines of by and tapply. That would have prevented me from needing to email the list (which I may have accidentally done twice; I apologize for that).

--Adam

On Sat, 3 May 2008, jim holtman wrote:

> ?aggregate

*>
**>> aggregate(df$score, list(df$var1, df$var2, df$id), mean, na.rm=TRUE)
**> Group.1 Group.2 Group.3 x
**> 1 1 1 1 0.1053576980
**> 2 2 1 1 0.1514888520
**> 3 3 1 1 0.1270477403
**> 4 4 1 1 -0.0193129404
**> 5 5 1 1 0.2574346931
**> 6 1 2 1 0.0185013523
**> 7 2 2 1 -0.0886420632
**> 8 3 2 1 -0.1304342272
**> 9 4 2 1 -0.0972963702
**> 10 5 2 1 -0.1463502593
**>
**>
**>
**> On Fri, May 2, 2008 at 6:43 PM, Adam D. I. Kramer <adik_at_ilovebacon.org> wrote:
**>> Dear Colleagues,
**>>
**>> Apologies for a long email to ask what I feel may be a very simple
**>> question; I figure it's better to overspecify my situation.
**>>
**>> I was asked a question, recently, by a colleague in my department
**>> about pre-aggregating variables, i.e., computing the mean of defined subsets
**>> of a data frame. Naturally, I thought of the 'by' and 'tapply' functions, as
**>> they have always been the solution for me. However, my colleague had three
**>> indices, and as such needs to pay attention to the indices of the
**>> output...this is to say, the "create an array" function of tapply doesn't
**>> quite work because an array is not quite what we want.
**>>
**>> Consider this data set:
**>>
**>> df <- data.frame(var1= factor(rep(rep(1:5,25*5),10)),
**>> var2= factor(rep(rep(1:5,each=25*5),10),
**>> trial= rep(rep(1:25,25),10),
**>> id= factor(rep(1:10,each=5*5*25)),
**>> score= rnorm(n=5*5*25*10) )
**>>
**>> ...this is to say, each of 10 ids has scores for 5 different levels of
**>> var1 and 5 different levels of var2...across 25 trials. Basically, a
**>> three-way crossed repeated measures design...where tapply does what I want
**>> for a two-way design, it does not quite suit my purposes for a 3-way or
**>> n-way for n > 2.
**>>
**>> The goal is to predict score from var1 and var2. The straightforward guess
**>> of what to do would be to simply have the AOV function aggregate across
**>> trials:
**>>
**>> aov(score ~ var1*var2 + Error(id/(var1*var2)), data=df)
**>>
**>> (or lm with defined contrasts)
**>>
**>> ...however, there are missing data on some trials for some people, which
**>> makes this design unbalanced (i.e., it introduces a correlation between var1
**>> and var2). Because my colleague knows (from a theoretical standpoint) that
**>> he wants to analyze the mean, his ANOVA on the aggregated trial means WOULD
**>> be balanced, which is to say, the analysis he wants to run would produce
**>> different output from the above.
**>>
**>> So, what he needs is a data frame with four variables instead of five: var1,
**>> var2, id, and mscore (mean score), which has been averaged across trials.
**>>
**>> Clearly (to me, it seems), the way to do this is with tapply:
**>>
**>> x <- tapply(df$score, list(df$var1,df$var2,df$id), mean, na.rm=TRUE)
**>>
**>> ...which returns a var1*var2 matrix for each ID, when what I want is a
**>> observation-per-row data frame.
**>>
**>> So, my question: How do I end up with what I'm looking for?
**>>
**>> My current process involves setting df2 <- data.frame(mscore=c(x), ...)
**>> where ... is a bunch of factor(rep) columns that would specify the var1 var2
**>> and id levels. My problem with this approach is that it seems like a hack;
**>> it is not a general solution because I must use knowledge of the process by
**>> which x was generated in order to "get it right," and there's a decent
**>> amount of room for unnoticed error on my part.
**>>
**>> I suppose what I'm looking for is either a way to take by or tapply and have
**>> it return a set of index variable columns based on the list of indices I
**>> provide to it...or a way to collapse an n-way table into a single data frame
**>> with index variables. Any suggestions?
**>>
**>> Cordially,
**>>
**>> Adam D. I. Kramer
**>> Ph.D. Candidate, Social Psychology
**>> University of Oregon
**>>
**>> ______________________________________________
**>> R-help_at_r-project.org mailing list
**>> https://stat.ethz.ch/mailman/listinfo/r-help
**>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
**>> and provide commented, minimal, self-contained, reproducible code.
**>>
**>
**>
**>
**> --
**> Jim Holtman
**> Cincinnati, OH
**> +1 513 646 9390
**>
**> What is the problem you are trying to solve?
**>
*

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat 03 May 2008 - 20:50:38 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Sun 04 May 2008 - 00:30:34 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*