From: Kjetil Brinchmann Halvorsen <kjetil_at_acelerate.com>

Date: Sun 10 Jul 2005 - 22:51:54 EST

Date: Sun 10 Jul 2005 - 22:51:54 EST

>Dear all:

*>I still need some further help since I think the question itself might
**>be very interesting (i hope so:) :
**>the question is on chisq.test, my concern is which criteria should be
**>used here to evaluate the independence. The reason i use this old
**>subject of the email is, b/c I think the problem here is about how to
**>look at p.value, which evaluate the strength of evidence instead of
**>association. If i am wrong, please correct me.
**>
**>the result looks like this:
**> index word.comb id in.class0 in.class1 p.value odds.ratio
**>1 1 TOTAL|LAID 54|241 2 4 0.0004997501 0.00736433
**>2 2 THEFT|RECOV 52|53 40751 146 0.0004997501 4.17127643
**>3 3 BOLL|ACCID 10|21 36825 1202 0.0004997501 0.44178546
**>4 4 LAB|VANDAL 8|55 24192 429 0.0004997501 0.82876099
**>5 5 VANDAL|CAUS 55|59 801 64 0.0004997501 0.18405918
**>6 6 AI|TOTAL 9|54 1949 45 0.0034982509 0.63766766
**>7 7 AI|RECOV 9|53 2385 61 0.0004997501 0.57547012
**>8 8 THEFT|TOTAL 52|54 33651 110 0.0004997501 4.56174408
**>
**>the target is to look for best subset of word.comb to differentiate
**>between class0 and class1. p.value is obtained via
**>p.chisq.sim[i] <- as.double(chisq.test(tab, sim=TRUE, B=myB)$p.value)
**>and B=20000 (I increased B and it won't help. the margin here is
**>class0=2162792
**>class1=31859
**>)
**>
**>So, in conclusion, which one I should use first, odds.ratio or p.value
**>to find the best subset.
**>
**>
**>
*

If your goal is to discriminate between two different classes, why not
calculate directly

a measure of discriminative ability, such as probability of correct
classification?

Kjetil

>I read some on feature selection in text categorization (A comparative

*>study on feature selection in text categorization might be a good
**>ref.). Anyone here has other suggestions?
**>
**>thanks,
**>
**>weiwei
**>
**>
**>On 6/24/05, Gabor Grothendieck <ggrothendieck@gmail.com> wrote:
**>
**>
**>>On 6/24/05, Kjetil Brinchmann Halvorsen <kjetil@acelerate.com> wrote:
**>>
**>>
**>>>Weiwei Shi wrote:
**>>>
**>>>
**>>>
**>>>>Hi,
**>>>>I asked this question before, which was hidden in a bunch of
**>>>>questions. I repharse it here and hope I can get some help this time:
**>>>>
**>>>>I have 2 contingency tables which have the same group variable Y. I
**>>>>want to compare the strength of association between X1/Y and X2/Y. I
**>>>>am not sure if comparing p-values IS the way even though the
**>>>>probability of seeing such "weird" observation under H0 defines
**>>>>p-value and it might relate to the strength of association somehow.
**>>>>But I read the following statement from Alan Agresti's "An
**>>>>Introduction to Categorical Data Analysis" :
**>>>>"Chi-squared tests simply indicate the degree of EVIDENCE for an
**>>>>association....It is sensible to decompose chi-squared into
**>>>>components, study residuals, and estimate parameters such as odds
**>>>>ratios that describe the STRENGTH OF ASSOCIATION".
**>>>>
**>>>>
**>>>>
**>>>>
**>>>>
**>>>Here are some things you can do:
**>>>
**>>> > tab1<-array(c(11266, 125, 2151526, 31734), dim=c(2,2))
**>>>
**>>> > tab2<-array(c(43571, 52, 2119221, 31807), dim=c(2,2))
**>>> > library(epitools) # on CRAN
**>>> > ?odds.ratio
**>>>Help for 'odds.ratio' is shown in the browser
**>>> > library(help=epitools) # on CRAN
**>>> > tab1
**>>> [,1] [,2]
**>>>[1,] 11266 2151526
**>>>[2,] 125 31734
**>>> > odds.ratio(11266, 125, 2151526, 31734)
**>>>Error in fisher.test(tab) : FEXACT error 40.
**>>>Out of workspace. # so this are evidently for tables
**>>>with smaller counts
**>>> > library(vcd) # on CRAN
**>>>
**>>> > ?oddsratio
**>>>Help for 'oddsratio' is shown in the browser
**>>> > oddsratio( tab1) # really is logodds ratio
**>>>[1] 0.2807548
**>>> > plot(oddsratio( tab1) )
**>>> > library(help=vcd) # on CRAN Read this for many nice functions.
**>>> > fourfoldplot(tab1)
**>>> > mosaicplot(tab1) # not really usefull for this table
**>>>
**>>>Also has a look at function Crosstable in package gmodels.
**>>>
**>>>To decompose the chisqure you can program yourselves:
**>>>
**>>>decomp.chi <- function(tab) {
**>>> rows <- rowSums(tab)
**>>> cols <- colSums(tab)
**>>> N <- sum(rows)
**>>> E <- rows %o% cols / N
**>>> contrib <- (tab-E)^2/E
**>>> contrib }
**>>>
**>>>
**>>> > decomp.chi(tab1)
**>>> [,1] [,2]
**>>>[1,] 0.1451026 0.0007570624
**>>>[2,] 9.8504915 0.0513942218
**>>> >
**>>>
**>>>So you can easily see what cell contributes most to the overall chisquared.
**>>>
**>>>Kjetil
**>>>
**>>>
**>>>
**>>>
**>>>
**>>>
**>>>
**>>>>Can I do this "decomposition" in R for the following example including
**>>>>2 contingency tables?
**>>>>
**>>>>
**>>>>
**>>>>
**>>>>
**>>>>>tab1<-array(c(11266, 125, 2151526, 31734), dim=c(2,2))
**>>>>>tab1
**>>>>>
**>>>>>
**>>>>>
**>>>>>
**>>>> [,1] [,2]
**>>>>[1,] 11266 2151526
**>>>>[2,] 125 31734
**>>>>
**>>>>
**>>>>
**>>>>
**>>>>
**>>>>>tab2<-array(c(43571, 52, 2119221, 31807), dim=c(2,2))
**>>>>>tab2
**>>>>>
**>>>>>
**>>>>>
**>>>>>
**>>>> [,1] [,2]
**>>>>[1,] 43571 2119221
**>>>>[2,] 52 31807
**>>>>
**>>>>
**>>>>
**>>Here are a few more ways of doing this using chisq.test,
**>>glm and assocplot:
**>>
**>>
**>>
**>>>## chisq.test ###
**>>>
**>>>
**>>>tab1.chisq <- chisq.test(tab1)
**>>>
**>>>
**>>># decomposition of chisq
**>>>resid(tab1.chisq)^2
**>>>
**>>>
**>> [,1] [,2]
**>>[1,] 0.1451026 0.0007570624
**>>[2,] 9.8504915 0.0513942218
**>>
**>>
**>>
**>>># same
**>>>with(tab1.chisq, (observed - expected)^2/expected)
**>>>
**>>>
**>> [,1] [,2]
**>>[1,] 0.1451026 0.0007570624
**>>[2,] 9.8504915 0.0513942218
**>>
**>>
**>>
**>>
**>>># Pearson residuals
**>>>resid(tab1.chisq)
**>>>
**>>>
**>> [,1] [,2]
**>>[1,] 0.3809234 -0.02751477
**>>[2,] -3.1385493 0.22670294
**>>
**>>
**>>
**>>># same
**>>>with(tab1.chisq, (observed - expected)/sqrt(expected))
**>>>
**>>>
**>> [,1] [,2]
**>>[1,] 0.3809234 -0.02751477
**>>[2,] -3.1385493 0.22670294
**>>
**>>
**>>
**>>
**>>>### glm ###
**>>># Pearson residuals via glm
**>>>
**>>>
**>>>tab1.df <- data.frame(count = c(tab1), A = gl(2,2), B = gl(2,1,4))
**>>>tab1.glm <- glm(count ~ ., tab1.df, family = poisson())
**>>>resid(tab1.glm, type = "pearson")
**>>>
**>>>
**>> 1 2 3 4
**>> 0.38092339 -3.13854927 -0.02751477 0.22670294
**>>
**>>
**>>>plot(tab1.glm)
**>>>
**>>>
**>>>### assocplot ###
**>>># displaying Pearson residuals via an assocplot
**>>>assocplot(t(tab1))
**>>>
**>>>
**>
**>
**>
**>
*

-- Kjetil Halvorsen. Peace is the most effective weapon of mass construction. -- Mahdi Elmandjra -- No virus found in this outgoing message. Checked by AVG Anti-Virus. ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.htmlReceived on Sun Jul 10 22:57:57 2005

*
This archive was generated by hypermail 2.1.8
: Fri 03 Mar 2006 - 03:33:27 EST
*