# Re: [R] Lack of independence in anova()

From: Douglas Bates <dmbates_at_gmail.com>
Date: Tue 05 Jul 2005 - 06:44:19 EST

I wrote up a note on how to do a more efficient simulation of the p-values from anova then discovered to my surprise that the chi-square test for independence of the significance of the F-tests indicated that they were not independent. I was stumped by this but fortunately Thomas Lumley came to my rescue with an explanation. There is no reason why the results of the F tests should be independent. The numerators are independent but the denominator is the same for both tests. When, due to random variation, the denominator is small, then the p-values for both tests will tend to be small. If, instead of F-tests you use chi-square tests then you do see independence.

Here is the note on the simulation.

There are several things that could be done to speed the simulation of the p-values of an anova like this under the null distribution.

If you examine the structure of a fitted lm object (use the function str()) you will see that there are components called `qr', `effects' and `assign'. You can verify by experimentation that `qr' and `assign' depend only on the experimental design. Furthermore, the `effects' vector can be reproduced as qr.qty(qrstr, y).

> set.seed(1234321)
> vv <- data.frame(c = gl(3,3,18), r = gl(2,9,18))
> vv

c r
1 1 1
2 1 1
3 1 1
4 2 1
5 2 1
6 2 1
7 3 1
8 3 1
9 3 1
10 1 2
11 1 2
12 1 2
13 2 2
14 2 2
15 2 2
16 3 2
17 3 2
18 3 2
> fm1 <- lm(rnorm(18) ~ c*r, vv)
> fm1\$assign

 0 1 1 2 3 3
> asgn <- c(fm1\$assign, rep(4, 12))
> system.time(res <- replicate(10000, tapply(qr.qty(fm1\$qr, rnorm(18))^2, asgn, sum)))
 20.61 0.01 20.61 0.00 0.00
> res[,1:6]

```0 0.4783121 0.3048634 0.713689 0.6937838 0.03649023  2.63392426
1 0.5825213 1.4756395 1.127018 0.5209751 1.18697199  3.32972093
2 0.2612723 3.6396106 0.547506 1.1641910 0.37843963  0.03411672
3 2.6259806 3.5504584 1.645215 0.1197238 0.85361018  4.53895212
4 9.1942755 8.2122693 4.863392 5.4413571 2.03715439 22.94815118

```

The rows of that array correspond to the sum of squares for the Intercept (which we generally ignore), the factor 'c', the factor 'r', their interaction and the residuals.

As you can see that took about 21 seconds on my system, which I expect is a bit faster than your simulation ran.

Because I set the seed to a known value I can reproduce the results for the first few simulations to check that the sums of squares are correct. Remember that the original fit (fm1) is not included in the table.

> set.seed(1234321)
> fm1 <- lm(rnorm(18) ~ c*r, vv)
> anova(fm2 <- lm(rnorm(18) ~ c*r, vv))
Analysis of Variance Table

Response: rnorm(18)

```          Df Sum Sq Mean Sq F value Pr(>F)
c          2 0.5825  0.2913  0.3801 0.6917
r          1 0.2613  0.2613  0.3410 0.5701
c:r        2 2.6260  1.3130  1.7137 0.2215
Residuals 12 9.1943  0.7662

```

You can continue this process if you wish further verification that the results correspond to the fitted models.

You can get the p-values for the F-tests as

```+                      pr = pf((res[3,]/1)/(res[5,]/12), 1, 12, low = FALSE),
+                      pint = pf((res[4,]/2)/(res[5,]/12), 2, 12, low = FALSE))

```

Again you can check this for the first few by hand.

pc pr pint

```1 0.69171238 0.57006574 0.2214847
2 0.37102129 0.03975286 0.1158059
3 0.28634939 0.26771167 0.1740633
4 0.57775850 0.13506561 0.8775828
5 0.06363138 0.16124100 0.1224806

```

If you wish you could then check marginal distributions using techniques like an empirical density plot.

> library(lattice)
> densityplot(~ pc, pvals)

At this point I would recommend checking the joint distribution but if you want to choose a specific level and check the contingency table that could be done as

> xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF)

I(pint < 0.16)
I(pr < 0.16) FALSE TRUE

```       FALSE  7204 1240
TRUE   1215  341

```

The summary method for an xtabs object provides a test of independence

> summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsF))
Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsF) Number of cases in table: 10000
Number of factors: 2
Test for independence of all factors:

for which you can see the puzzling result. However, if you use the chisquared test based on the known residual variance of 1, you get

```+                      pr = pchisq(res[3,], 1, low = FALSE),
+                      pint = pchisq(res[4,], 2, low = FALSE))

> pvalsC[1:5,]

pc         pr      pint
1 0.7473209 0.60924741 0.2690144
2 0.4781553 0.05642013 0.1694446
3 0.5692081 0.45933855 0.4392846
```

4 0.7706757 0.28059805 0.9418946
5 0.5523983 0.53843951 0.6525907
> xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC)

I(pint < 0.16)
I(pr < 0.16) FALSE TRUE

```       FALSE  7121 1319
TRUE   1324  236
```

> summary(xtabs(~ I(pr < 0.16) + I(pint < 0.16), pvalsC))
Call: xtabs(formula = ~I(pr < 0.16) + I(pint < 0.16), data = pvalsC) Number of cases in table: 10000
Number of factors: 2
Test for independence of all factors:

On 7/4/05, Phillip Good <pigood@verizon.net> wrote:
> To my surprise, the R functions I employed did NOT verify the independence
> property. I've no quarrel with the theory--it's the R functions that worry
> me.
>
> pG
>
> -----Original Message-----
> From: Douglas Bates [mailto:dmbates@gmail.com]
> Sent: Monday, July 04, 2005 9:13 AM
> To: pigood@verizon.net; r-help
> Subject: Re: [R] Lack of independence in anova()
>
>
> I have already had email exchanges off-list with Phillip Good pointing
> out that the independence property that he wishes to establish by
> simulation is a consequence of orthogonality of the column span of the
> row contrasts and the interaction contrasts. If you set the contrasts
> option to a set of orthogonal contrasts such as contr.helmert or
> contr.sum, which has no effect on the results of the anova, this is
> easily established.
>
> > build
> function(size, v = rnorm(sum(size))) {
> col=c(rep(0,size),rep(1,size),rep(2,size),rep(3,size),
> rep(0,size),rep(1,size),rep(2,size),rep(3,size))
> row=c(rep(0,size+size+size+size),rep(1,size+size
> +size+size))
> return(data.frame(c=factor(col), r=factor(row),yield=v))
> }
> > size
>  3 3 3 0 3 3 3 0
> > set.seed(1234321)
> > vv <- build(size)
> > vv
> c r yield
> 1 0 0 1.2369081
> 2 0 0 1.5616230
> 3 0 0 1.8396185
> 4 1 0 0.3173245
> 5 1 0 1.0715115
> 6 1 0 -1.1459955
> 7 2 0 0.2488894
> 8 2 0 0.1158625
> 9 2 0 2.6200816
> 10 0 1 1.2624048
> 11 0 1 -0.9862578
> 12 0 1 -0.3235653
> 13 1 1 0.2039706
> 14 1 1 -1.4574576
> 15 1 1 1.9158713
> 16 2 1 -2.0333909
> 17 2 1 1.0050272
> 18 2 1 0.6789184
> > options(contrasts = c('contr.helmert', 'contr.poly'))
> > crossprod(model.matrix(~c*r, vv))
> (Intercept) c1 c2 r1 c1:r1 c2:r1
> (Intercept) 18 0 0 0 0 0
> c1 0 12 0 0 0 0
> c2 0 0 36 0 0 0
> r1 0 0 0 18 0 0
> c1:r1 0 0 0 0 12 0
> c2:r1 0 0 0 0 0 36
>
>
> On 7/4/05, Phillip Good <pigood@verizon.net> wrote:
> > If the observations are normally distributed and the 2xk design is
> > balanced, theory requires that the tests for interaction and row effects
> be
> > independent. In my program, appended below, this would translate to cntT
> > (approx)= cntR*cntI/N if all R routines were functioning correctly. They

> > aren't.
> >
> > sim2=function(size,N,p){
> > cntR=0
> > cntC=0
> > cntI=0
> > cntT=0
> > cntP=0
> > for(i in 1:N){
> > #generate data
> > v=gendata(size)
> > #analyze after build(ing) design containing data
> > lm.out=lm(yield~c*r,build(size,v))
> > av.out=anova(lm.out)
> > #if column effect is significant, increment cntC
> > if (av.out[]<=p)cntC=cntC+1
> > #if row effect is significant, increment cntR
> > if (av.out[]<=p){
> > cntR=cntR+1
> > tmp = 1
> > }
> > else tmp =0
> > if (av.out[]<=p){
> > #if interaction is significant, increment cntI
> > cntI=cntI+1
> > #if both interaction and row effect are significant, increment
> cntT
> > cntT=cntT + tmp
> > }
> > }
> > list(cntC=cntC, cntR=cntR, cntI=cntI, cntT=cntT)
> > }
> >
> > build=function(size,v){
> > #size is a vector containing the sample sizes
> > col=c(rep(0,size),rep(1,size),rep(2,size),rep(3,size),
> > rep(0,size),rep(1,size),rep(2,size),rep(3,size))
> > row=c(rep(0,size+size+size+size),rep(1,size+size
> > +size+size))
> > return(data.frame(c=factor(col), r=factor(row),yield=v))
> > }
> >
> > gendata=function(size){
> > ssize=sum(size);
> > return (rnorm(ssize))
> > }
> >
> > #Example
> > size=c(3,3,3,0,3,3,3,0)
> > sim2(size,10000,10,.16)
> >
> >
> >
> > Phillip Good
> > Huntington Beach CA
> >
> >
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list
> >
https://stat.ethz.ch/mailman/listinfo/r-help
> http://www.R-project.org/posting-guide.html
> >
> >
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help