Re: [R] comparing strength of association instead of strength of evidence?

From: Weiwei Shi <helprhelp_at_gmail.com>
Date: Mon 11 Jul 2005 - 01:19:27 EST

For this step, my purpose is feature construction and feature subsetting. In this project, I am using contrast association rule mining to build word-combinations; in the previous example, 2-itemset was created from CAR and tested for their "dependency" on class for feature subsetting/selection. Other methods like mutual information might be used too. Any methods that won't replicate "mechanisms" in my following step (see below) can be tried here. Could you detail what you meant by "a discriminative measure"?

The whole datasets also contain many non-words variables. To combine both data mining and text categorization is the focus and interests of this project. Decision tree, Bayesian network or SVM/LSI might be candidates.

Thanks for further suggestion,

Weiwei

On 7/10/05, Kjetil Brinchmann Halvorsen <kjetil@acelerate.com> wrote:
> Weiwei Shi wrote:
>
> >Dear all:
> >I still need some further help since I think the question itself might
> >be very interesting (i hope so:) :
> >the question is on chisq.test, my concern is which criteria should be
> >used here to evaluate the independence. The reason i use this old
> >subject of the email is, b/c I think the problem here is about how to
> >look at p.value, which evaluate the strength of evidence instead of
> >association. If i am wrong, please correct me.
> >
> >the result looks like this:
> > index word.comb id in.class0 in.class1 p.value odds.ratio
> >1 1 TOTAL|LAID 54|241 2 4 0.0004997501 0.00736433
> >2 2 THEFT|RECOV 52|53 40751 146 0.0004997501 4.17127643
> >3 3 BOLL|ACCID 10|21 36825 1202 0.0004997501 0.44178546
> >4 4 LAB|VANDAL 8|55 24192 429 0.0004997501 0.82876099
> >5 5 VANDAL|CAUS 55|59 801 64 0.0004997501 0.18405918
> >6 6 AI|TOTAL 9|54 1949 45 0.0034982509 0.63766766
> >7 7 AI|RECOV 9|53 2385 61 0.0004997501 0.57547012
> >8 8 THEFT|TOTAL 52|54 33651 110 0.0004997501 4.56174408
> >
> >the target is to look for best subset of word.comb to differentiate
> >between class0 and class1. p.value is obtained via
> >p.chisq.sim[i] <- as.double(chisq.test(tab, sim=TRUE, B=myB)$p.value)
> >and B=20000 (I increased B and it won't help. the margin here is
> >class0=2162792
> >class1=31859
> >)
> >
> >So, in conclusion, which one I should use first, odds.ratio or p.value
> >to find the best subset.
> >
> >
> >
> If your goal is to discriminate between two different classes, why not
> calculate directly
> a measure of discriminative ability, such as probability of correct
> classification?
>
> Kjetil
>
> >I read some on feature selection in text categorization (A comparative
> >study on feature selection in text categorization might be a good
> >ref.). Anyone here has other suggestions?
> >
> >thanks,
> >
> >weiwei
> >
> >
> >On 6/24/05, Gabor Grothendieck <ggrothendieck@gmail.com> wrote:
> >
> >
> >>On 6/24/05, Kjetil Brinchmann Halvorsen <kjetil@acelerate.com> wrote:
> >>
> >>
> >>>Weiwei Shi wrote:
> >>>
> >>>
> >>>
> >>>>Hi,
> >>>>I asked this question before, which was hidden in a bunch of
> >>>>questions. I repharse it here and hope I can get some help this time:
> >>>>
> >>>>I have 2 contingency tables which have the same group variable Y. I
> >>>>want to compare the strength of association between X1/Y and X2/Y. I
> >>>>am not sure if comparing p-values IS the way even though the
> >>>>probability of seeing such "weird" observation under H0 defines
> >>>>p-value and it might relate to the strength of association somehow.
> >>>>But I read the following statement from Alan Agresti's "An
> >>>>Introduction to Categorical Data Analysis" :
> >>>>"Chi-squared tests simply indicate the degree of EVIDENCE for an
> >>>>association....It is sensible to decompose chi-squared into
> >>>>components, study residuals, and estimate parameters such as odds
> >>>>ratios that describe the STRENGTH OF ASSOCIATION".
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>Here are some things you can do:
> >>>
> >>> > tab1<-array(c(11266, 125, 2151526, 31734), dim=c(2,2))
> >>>
> >>> > tab2<-array(c(43571, 52, 2119221, 31807), dim=c(2,2))
> >>> > library(epitools) # on CRAN
> >>> > ?odds.ratio
> >>>Help for 'odds.ratio' is shown in the browser
> >>> > library(help=epitools) # on CRAN
> >>> > tab1
> >>> [,1] [,2]
> >>>[1,] 11266 2151526
> >>>[2,] 125 31734
> >>> > odds.ratio(11266, 125, 2151526, 31734)
> >>>Error in fisher.test(tab) : FEXACT error 40.
> >>>Out of workspace. # so this are evidently for tables
> >>>with smaller counts
> >>> > library(vcd) # on CRAN
> >>>
> >>> > ?oddsratio
> >>>Help for 'oddsratio' is shown in the browser
> >>> > oddsratio( tab1) # really is logodds ratio
> >>>[1] 0.2807548
> >>> > plot(oddsratio( tab1) )
> >>> > library(help=vcd) # on CRAN Read this for many nice functions.
> >>> > fourfoldplot(tab1)
> >>> > mosaicplot(tab1) # not really usefull for this table
> >>>
> >>>Also has a look at function Crosstable in package gmodels.
> >>>
> >>>To decompose the chisqure you can program yourselves:
> >>>
> >>>decomp.chi <- function(tab) {
> >>> rows <- rowSums(tab)
> >>> cols <- colSums(tab)
> >>> N <- sum(rows)
> >>> E <- rows %o% cols / N
> >>> contrib <- (tab-E)^2/E
> >>> contrib }
> >>>
> >>>
> >>> > decomp.chi(tab1)
> >>> [,1] [,2]
> >>>[1,] 0.1451026 0.0007570624
> >>>[2,] 9.8504915 0.0513942218
> >>> >
> >>>
> >>>So you can easily see what cell contributes most to the overall chisquared.
> >>>
> >>>Kjetil
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>Can I do this "decomposition" in R for the following example including
> >>>>2 contingency tables?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>tab1<-array(c(11266, 125, 2151526, 31734), dim=c(2,2))
> >>>>>tab1
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> [,1] [,2]
> >>>>[1,] 11266 2151526
> >>>>[2,] 125 31734
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>tab2<-array(c(43571, 52, 2119221, 31807), dim=c(2,2))
> >>>>>tab2
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> [,1] [,2]
> >>>>[1,] 43571 2119221
> >>>>[2,] 52 31807
> >>>>
> >>>>
> >>>>
> >>Here are a few more ways of doing this using chisq.test,
> >>glm and assocplot:
> >>
> >>
> >>
> >>>## chisq.test ###
> >>>
> >>>
> >>>tab1.chisq <- chisq.test(tab1)
> >>>
> >>>
> >>># decomposition of chisq
> >>>resid(tab1.chisq)^2
> >>>
> >>>
> >> [,1] [,2]
> >>[1,] 0.1451026 0.0007570624
> >>[2,] 9.8504915 0.0513942218
> >>
> >>
> >>
> >>># same
> >>>with(tab1.chisq, (observed - expected)^2/expected)
> >>>
> >>>
> >> [,1] [,2]
> >>[1,] 0.1451026 0.0007570624
> >>[2,] 9.8504915 0.0513942218
> >>
> >>
> >>
> >>
> >>># Pearson residuals
> >>>resid(tab1.chisq)
> >>>
> >>>
> >> [,1] [,2]
> >>[1,] 0.3809234 -0.02751477
> >>[2,] -3.1385493 0.22670294
> >>
> >>
> >>
> >>># same
> >>>with(tab1.chisq, (observed - expected)/sqrt(expected))
> >>>
> >>>
> >> [,1] [,2]
> >>[1,] 0.3809234 -0.02751477
> >>[2,] -3.1385493 0.22670294
> >>
> >>
> >>
> >>
> >>>### glm ###
> >>># Pearson residuals via glm
> >>>
> >>>
> >>>tab1.df <- data.frame(count = c(tab1), A = gl(2,2), B = gl(2,1,4))
> >>>tab1.glm <- glm(count ~ ., tab1.df, family = poisson())
> >>>resid(tab1.glm, type = "pearson")
> >>>
> >>>
> >> 1 2 3 4
> >> 0.38092339 -3.13854927 -0.02751477 0.22670294
> >>
> >>
> >>>plot(tab1.glm)
> >>>
> >>>
> >>>### assocplot ###
> >>># displaying Pearson residuals via an assocplot
> >>>assocplot(t(tab1))
> >>>
> >>>
> >
> >
> >
> >
>
>
> --
>
> Kjetil Halvorsen.
>
> Peace is the most effective weapon of mass construction.
> -- Mahdi Elmandjra
>
>
>
>
> --
> No virus found in this outgoing message.
> Checked by AVG Anti-Virus.
> Version: 7.0.323 / Virus Database: 267.8.11/44 - Release Date: 08/07/2005
>
>

-- 
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Mon Jul 11 01:26:00 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:33:27 EST