From: Weiwei Shi <helprhelp_at_gmail.com>

Date: Thu 23 Jun 2005 - 09:08:04 EST

Date: Thu 23 Jun 2005 - 09:08:04 EST

Is it b/c my question is too long so no one answers it? I should have splitted it. :(

On 6/22/05, Kjetil Brinchmann Halvorsen <kjetil@acelerate.com> wrote:

> Weiwei Shi wrote:

*>
**> >Hi,
**> >I have a text mining project and currently I am working on feature
**> >generation/selection part.
**> >My plan is selecting a set of words or word combinations which have
**> >better discriminant capability than other words in telling the group
**> >id's (2 classes in this case) for a dataset which has 2,000,000
**> >documents.
**> >
**> >One approach is using "contrast-set association rule mining" while the
**> >other is using chisqr or fisher exact test.
**> >
**> >An example which has 3 contingency tables for 3 words as followed
**> >(word coded by number):
**> >
**> >
**> >>tab[,,1:3]
**> >>
**> >>
**> >, , 1
**> >
**> > [,1] [,2]
**> >[1,] 11266 2151526
**> >[2,] 125 31734
**> >
**> >, , 2
**> >
**> > [,1] [,2]
**> >[1,] 43571 2119221
**> >[2,] 52 31807
**> >
**> >, , 3
**> >
**> > [,1] [,2]
**> >[1,] 427 2162365
**> >[2,] 5 31854
**> >
**> >
**> >I have some questions on this:
**> >1. What's the thumb of rule to use chisq test instead of Fisher exact
**> >test. I have a vague memory which said for each cell, the count needs
**> >to be over 50 if chisq instead of fisher exact test is going to be
**> >used. In the case of word 3, I think I should use fisher test.
**> >However, running chisq like below is fine:
**> >
**> >
**> >>tab[,,3]
**> >>
**> >>
**> > [,1] [,2]
**> >[1,] 427 2162365
**> >[2,] 5 31854
**> >
**> >
**> >>chisq.test(tab[,,3])
**> >>
**> >>
**> >
**> > Pearson's Chi-squared test with Yates' continuity correction
**> >
**> >data: tab[, , 3]
**> >X-squared = 0.0963, df = 1, p-value = 0.7564
**> >
**> >but running on the whole set of words (including 14240 words) has the
**> >following warnings:
**> >
**> >
**> >>p.chisq<-as.double(lapply(1:N, function(i) chisq.test(tab[,,i])$p.value))
**> >>
**> >>
**> >There were 50 or more warnings (use warnings() to see the first 50)
**> >
**> >
**> >>warnings()
**> >>
**> >>
**> >Warning messages:
**> >1: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
**> >2: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
**> >3: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
**> >4: Chi-squared approximation may be incorrect in: chisq.test(tab[, , i])
**> >
**> >
**> >2. So, my second question is, is this warning b/c I am against the
**> >assumption of using chisq. But why Word 3 is fine? How to trace the
**> >warning to see which word caused this warning?
**> >
**> >3. My result looks like this (after some mapping treating from number
**> >id to word and some words are stemmed here, like ACCID is accident):
**> > > of[1:50,]
**> > map...2. p.fisher
**> >21 ACCID 0.000000e+00
**> >30 CD 0.000000e+00
**> >67 ROCK 0.000000e+00
**> >104 CRACK 0.000000e+00
**> >111 CHIP 0.000000e+00
**> >179 GLASS 0.000000e+00
**> >84 BACK 4.199878e-291
**> >395 DRIVEABL 5.335989e-287
**> >60 CAP 9.405235e-285
**> >262 WINDSHIELD 2.691641e-254
**> >13 IV 3.905186e-245
**> >110 HZ 2.819713e-210
**> >11 CAMP 9.086768e-207
**> >2 SHATTER 5.273994e-202
**> >297 ALP 1.678521e-177
**> >162 BED 1.822031e-173
**> >249 BCD 1.398391e-160
**> >493 RACK 4.178617e-156
**> >59 CAUS 7.539031e-147
**> >
**> >3.1 question: Should I use two-sided test instead of one-sided for
**> >fisher test? I read some material which suggests using two-sided.
**> >
**> >3.2 A big question: Even though the result looks very promising since
**> >this is case of classiying fraud cases and the words selected by this
**> >approach make sense. However, I think p-values here just indicate the
**> >strength to reject null hypothesis, not the strength of association
**> >between word and class of document. So, what kind of statistics I
**> >should use here to evaluate the strength of association? odds ratio?
**> >
**> >Any suggestions are welcome!
**> >
**> >Thanks!
**> >
**> >
**> You can use chisq.test with sim=TRUE, or call it as usual first, see if
**> there is a warning, and then recall
**> with sim=TRUE.
**>
**> Kjetil
**>
**> --
**>
**> Kjetil Halvorsen.
**>
**> Peace is the most effective weapon of mass construction.
**> -- Mahdi Elmandjra
**>
**>
**>
**>
**> --
**> No virus found in this outgoing message.
**> Checked by AVG Anti-Virus.
**> Version: 7.0.323 / Virus Database: 267.7.7/20 - Release Date: 16/06/2005
**>
**>
*

-- Weiwei Shi, Ph.D "Did you always know?" "No, I did not. But I believed..." ---Matrix III ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.htmlReceived on Thu Jun 23 09:12:34 2005

*
This archive was generated by hypermail 2.1.8
: Fri 03 Mar 2006 - 03:32:57 EST
*