[R] chisq test and fisher exact test

From: Weiwei Shi <helprhelp_at_gmail.com>
Date: Thu 23 Jun 2005 - 01:30:06 EST

I have a text mining project and currently I am working on feature generation/selection part.
My plan is selecting a set of words or word combinations which have better discriminant capability than other words in telling the group id's (2 classes in this case) for a dataset which has 2,000,000 documents.

One approach is using "contrast-set association rule mining" while the other is using chisqr or fisher exact test.

An example which has 3 contingency tables for 3 words as followed (word coded by number):
> tab[,,1:3]

, , 1

      [,1] [,2]
[1,] 11266 2151526
[2,] 125 31734

, , 2

      [,1] [,2]
[1,] 43571 2119221
[2,] 52 31807

, , 3

     [,1] [,2]
[1,] 427 2162365
[2,] 5 31854

I have some questions on this:
1. What's the thumb of rule to use chisq test instead of Fisher exact test. I have a vague memory which said for each cell, the count needs to be over 50 if chisq instead of fisher exact test is going to be used. In the case of word 3, I think I should use fisher test. However, running chisq like below is fine:
> tab[,,3]

     [,1] [,2]
[1,] 427 2162365
[2,] 5 31854
> chisq.test(tab[,,3])

        Pearson's Chi-squared test with Yates' continuity correction

data: tab[, , 3]
X-squared = 0.0963, df = 1, p-value = 0.7564

but running on the whole set of words (including 14240 words) has the following warnings:
> p.chisq<-as.double(lapply(1:N, function(i) chisq.test(tab[,,i])$p.value))
There were 50 or more warnings (use warnings() to see the first 50)
> warnings()

Warning messages:

1: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
2: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
3: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
4: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])

2. So, my second question is, is this warning b/c I am against the assumption of using chisq. But why Word 3 is fine? How to trace the warning to see which word caused this warning?

3. My result looks like this (after some mapping treating from number id to word and some words are stemmed here, like ACCID is accident):  > of[1:50,]

      map...2.      p.fisher
21       ACCID  0.000000e+00
30          CD  0.000000e+00
67        ROCK  0.000000e+00
104      CRACK  0.000000e+00
111       CHIP  0.000000e+00
179      GLASS  0.000000e+00
84        BACK 4.199878e-291
395   DRIVEABL 5.335989e-287
60         CAP 9.405235e-285
262 WINDSHIELD 2.691641e-254
13          IV 3.905186e-245
110         HZ 2.819713e-210
11        CAMP 9.086768e-207
2      SHATTER 5.273994e-202
297        ALP 1.678521e-177
162        BED 1.822031e-173
249        BCD 1.398391e-160
493       RACK 4.178617e-156
59        CAUS 7.539031e-147

3.1 question: Should I use two-sided test instead of one-sided for fisher test? I read some material which suggests using two-sided.

3.2 A big question: Even though the result looks very promising since this is case of classiying fraud cases and the words selected by this approach make sense. However, I think p-values here just indicate the strength to reject null hypothesis, not the strength of association between word and class of document. So, what kind of statistics I should use here to evaluate the strength of association? odds ratio?

Any suggestions are welcome!


Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

R-help@stat.math.ethz.ch mailing list
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Thu Jun 23 01:46:23 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:32:57 EST