Re: [R] slow computation of mixed ANOVA using aov

From: Peter Dalgaard <p.dalgaard_at_biostat.ku.dk>
Date: Sat 19 Mar 2005 - 21:19:19 EST

"Steven Lacey" <slacey@umich.edu> writes:

> Dear R-help list,
>
> I am trying to do a mixed ANOVA on a 8960 x 5 dataframe. I have 3 factors
> for which I want to test all main effects and interactions : f1 (40 levels),
> f2 (7 levels), and f3 (4 levels). I also have a subject factor, subject, and
> a dependent variable, dv.
>
> Some more information about the factors:
> f2 is a between-subject factor. That is, for each level of f2 there are 8
> nested levels of the subject factor. For example, levels 1-8 of subject are
> nested in level 1 of f2. Levels 9-16 of subject are nested in level 2 of f2.
> In other words, the subjects that participated in any level of f2 are
> different from the subjects that participated in any other level of f2.
>
> In contrast, f1 and f3 are within-subject factors. That is, for any one of
> the 56 subjects, we have a 160 medians corresponding to each condition from
> a complete crossing of factors f1 and f2. While it is true that we do have
> replicate observations for any subject in each of these conditions, we take
> the median of those values and operate as if there is only a single
> observation for each subject in each of the 160 within-subject conditions.
>
> Below is code that will generate dataframe d, which is just like the one I
> am working with:
>
> f1<-gl(40,1,8960,ordered=T)
> f2<-gl(7,1280,8960,ordered=T)
> f3<-gl(4,40,8960,ordered=T)
> subject<-gl(56,160,8960,ordered=T)
> dv<-rnorm(8960,mean=500,sd=50)
> d <- data.frame(f1,f2,f3,f4,dv)
>
> To run the mixed ANOVA I use the following call (modeled after J. Baron):
> aov(dv~f1*f2*f3+Error(subject/(f1*f2)),data=d)

[You mean subject/(f1*f3), right? "f2" is a coarsening of "subject"]

> WARNING: Exert caution when running the aov command. I have run the exact
> same command on Windows and Unix machines (each with 1GB of RAM; allocated
> up to 3 or 4GB of memory for the analysis ) and it has taken many, many
> hours to finish. That said, this is not a new problem posted on the R-help
> list. There are several posts where analysts have run into similar problems.
> My general impression of these posts, and correct me if I am wrong, is that
> because aov is a wrapper around lm, the extra time is required to build and
> manipulate a design matrix (via qr decomposition) that is 8960 x several
> thousand columns large! Is that so? It seems fitting because if I call aov
> with only a single factor, then it returns in a few seconds.

Yes, this is basically correct. The main issue is the calculation of the projection onto the terms in the Error model, which is done using the generic lm algorithm even though the design is typicaly orthogonal so that there are much faster ways to get to the result.

To paraphrase: if you have double determinations and an error term at the level of each pair, the algorithm fits a model with N/2 parameters in order to calculate the mean and difference within each pair. For large designs, this becomes an issue...

This is in some sense a tradeoff of generality for speed, but the results for non-orthogonal designs are typically very hard to interpret.

The topic does come up off and on, and we have been considering the option of adding an algorithm where it is assumed that the Error model consists of mutually orthogonal and balanced terms (in the sense that all factor levels appear equally frequently). Just a "simple matter of programming"...

For the near term, there are a couple of things that you can do:

> In order to test if the computational slowness was something unique to R, I
> ran the same analysis, including all 3 factors, in SPSS. To my surprise SPSS
> returned almost instantaneously. I do not know much about the algorithm in
> SPSS, but I suspect it may be calculating condition means and sums of
> squares rather than generating a design matrix. Does that sound plausible?
>
> At this point I am a dedicated R user. However, I do the kind of analysis
> described above quite often. It is important that my statistical package be
> able to handle it more efficiently than what I have been able to get R to do
> at this point. Am I doing anything obviously wrong? Is there a method in R
> that more closely resembles the algorithm used in SPSS? If not, are there
> any other methods R has to do these kind of analyses? Could I split up the
> analysis in some principled way to ease the processing demand on R?
>
> Thanks in advanvce for any further insight into this issue,
> Steve Lacey
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Sat Mar 19 21:33:02 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:30:51 EST