From: Berton Gunter <gunter.berton_at_gene.com>

Date: Sat 09 Jul 2005 - 03:23:02 EST

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sat Jul 09 03:28:49 2005

Date: Sat 09 Jul 2005 - 03:23:02 EST

But isn't this the old "chestnut" that any "effect" will be found
"significant" given enough data; and with too few data, not even a large one
can be distinguished from noise?

If so, it's a good question that has more to do with the philosophy of science than statistics. Bayesians, of course, would disagree (but they don't have p-values).

- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA

"The business of the statistician is to catalyze the scientific learning process." - George E. P. Box

> -----Original Message-----

*> From: r-help-bounces@stat.math.ethz.ch
**> [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Weiwei Shi
**> Sent: Friday, July 08, 2005 10:06 AM
**> To: Gabor Grothendieck
**> Cc: R-help@stat.math.ethz.ch; Kjetil Brinchmann Halvorsen
**> Subject: Re: [R] comparing strength of association instead of
**> strength ofevidence?
**>
**> Dear all:
**> I still need some further help since I think the question itself might
**> be very interesting (i hope so:) :
**> the question is on chisq.test, my concern is which criteria should be
**> used here to evaluate the independence. The reason i use this old
**> subject of the email is, b/c I think the problem here is about how to
**> look at p.value, which evaluate the strength of evidence instead of
**> association. If i am wrong, please correct me.
**>
**> the result looks like this:
**> index word.comb id in.class0 in.class1 p.value
**> odds.ratio
**> 1 1 TOTAL|LAID 54|241 2 4 0.0004997501
**> 0.00736433
**> 2 2 THEFT|RECOV 52|53 40751 146 0.0004997501
**> 4.17127643
**> 3 3 BOLL|ACCID 10|21 36825 1202 0.0004997501
**> 0.44178546
**> 4 4 LAB|VANDAL 8|55 24192 429 0.0004997501
**> 0.82876099
**> 5 5 VANDAL|CAUS 55|59 801 64 0.0004997501
**> 0.18405918
**> 6 6 AI|TOTAL 9|54 1949 45 0.0034982509
**> 0.63766766
**> 7 7 AI|RECOV 9|53 2385 61 0.0004997501
**> 0.57547012
**> 8 8 THEFT|TOTAL 52|54 33651 110 0.0004997501
**> 4.56174408
**>
**> the target is to look for best subset of word.comb to differentiate
**> between class0 and class1. p.value is obtained via
**> p.chisq.sim[i] <- as.double(chisq.test(tab, sim=TRUE, B=myB)$p.value)
**> and B=20000 (I increased B and it won't help. the margin here is
**> class0=2162792
**> class1=31859
**> )
**>
**> So, in conclusion, which one I should use first, odds.ratio or p.value
**> to find the best subset.
**>
**> I read some on feature selection in text categorization (A comparative
**> study on feature selection in text categorization might be a good
**> ref.). Anyone here has other suggestions?
**>
**> thanks,
**>
**> weiwei
**>
**>
**> On 6/24/05, Gabor Grothendieck <ggrothendieck@gmail.com> wrote:
**> > On 6/24/05, Kjetil Brinchmann Halvorsen
**> <kjetil@acelerate.com> wrote:
**> > > Weiwei Shi wrote:
**> > >
**> > > >Hi,
**> > > >I asked this question before, which was hidden in a bunch of
**> > > >questions. I repharse it here and hope I can get some
**> help this time:
**> > > >
**> > > >I have 2 contingency tables which have the same group
**> variable Y. I
**> > > >want to compare the strength of association between X1/Y
**> and X2/Y. I
**> > > >am not sure if comparing p-values IS the way even though the
**> > > >probability of seeing such "weird" observation under H0 defines
**> > > >p-value and it might relate to the strength of
**> association somehow.
**> > > >But I read the following statement from Alan Agresti's "An
**> > > >Introduction to Categorical Data Analysis" :
**> > > >"Chi-squared tests simply indicate the degree of EVIDENCE for an
**> > > >association....It is sensible to decompose chi-squared into
**> > > >components, study residuals, and estimate parameters such as odds
**> > > >ratios that describe the STRENGTH OF ASSOCIATION".
**> > > >
**> > > >
**> > > >
**> > > Here are some things you can do:
**> > >
**> > > > tab1<-array(c(11266, 125, 2151526, 31734), dim=c(2,2))
**> > >
**> > > > tab2<-array(c(43571, 52, 2119221, 31807), dim=c(2,2))
**> > > > library(epitools) # on CRAN
**> > > > ?odds.ratio
**> > > Help for 'odds.ratio' is shown in the browser
**> > > > library(help=epitools) # on CRAN
**> > > > tab1
**> > > [,1] [,2]
**> > > [1,] 11266 2151526
**> > > [2,] 125 31734
**> > > > odds.ratio(11266, 125, 2151526, 31734)
**> > > Error in fisher.test(tab) : FEXACT error 40.
**> > > Out of workspace. # so this are evidently
**> for tables
**> > > with smaller counts
**> > > > library(vcd) # on CRAN
**> > >
**> > > > ?oddsratio
**> > > Help for 'oddsratio' is shown in the browser
**> > > > oddsratio( tab1) # really is logodds ratio
**> > > [1] 0.2807548
**> > > > plot(oddsratio( tab1) )
**> > > > library(help=vcd) # on CRAN Read this for many nice functions.
**> > > > fourfoldplot(tab1)
**> > > > mosaicplot(tab1) # not really usefull for this table
**> > >
**> > > Also has a look at function Crosstable in package gmodels.
**> > >
**> > > To decompose the chisqure you can program yourselves:
**> > >
**> > > decomp.chi <- function(tab) {
**> > > rows <- rowSums(tab)
**> > > cols <- colSums(tab)
**> > > N <- sum(rows)
**> > > E <- rows %o% cols / N
**> > > contrib <- (tab-E)^2/E
**> > > contrib }
**> > >
**> > >
**> > > > decomp.chi(tab1)
**> > > [,1] [,2]
**> > > [1,] 0.1451026 0.0007570624
**> > > [2,] 9.8504915 0.0513942218
**> > > >
**> > >
**> > > So you can easily see what cell contributes most to the
**> overall chisquared.
**> > >
**> > > Kjetil
**> > >
**> > >
**> > >
**> > >
**> > >
**> > > >Can I do this "decomposition" in R for the following
**> example including
**> > > >2 contingency tables?
**> > > >
**> > > >
**> > > >
**> > > >>tab1<-array(c(11266, 125, 2151526, 31734), dim=c(2,2))
**> > > >>tab1
**> > > >>
**> > > >>
**> > > > [,1] [,2]
**> > > >[1,] 11266 2151526
**> > > >[2,] 125 31734
**> > > >
**> > > >
**> > > >
**> > > >>tab2<-array(c(43571, 52, 2119221, 31807), dim=c(2,2))
**> > > >>tab2
**> > > >>
**> > > >>
**> > > > [,1] [,2]
**> > > >[1,] 43571 2119221
**> > > >[2,] 52 31807
**> > > >
**> >
**> >
**> > Here are a few more ways of doing this using chisq.test,
**> > glm and assocplot:
**> >
**> > > ## chisq.test ###
**> >
**> > > tab1.chisq <- chisq.test(tab1)
**> >
**> > > # decomposition of chisq
**> > > resid(tab1.chisq)^2
**> > [,1] [,2]
**> > [1,] 0.1451026 0.0007570624
**> > [2,] 9.8504915 0.0513942218
**> >
**> > > # same
**> > > with(tab1.chisq, (observed - expected)^2/expected)
**> > [,1] [,2]
**> > [1,] 0.1451026 0.0007570624
**> > [2,] 9.8504915 0.0513942218
**> >
**> >
**> > > # Pearson residuals
**> > > resid(tab1.chisq)
**> > [,1] [,2]
**> > [1,] 0.3809234 -0.02751477
**> > [2,] -3.1385493 0.22670294
**> >
**> > > # same
**> > > with(tab1.chisq, (observed - expected)/sqrt(expected))
**> > [,1] [,2]
**> > [1,] 0.3809234 -0.02751477
**> > [2,] -3.1385493 0.22670294
**> >
**> >
**> > > ### glm ###
**> > > # Pearson residuals via glm
**> >
**> > > tab1.df <- data.frame(count = c(tab1), A = gl(2,2), B = gl(2,1,4))
**> > > tab1.glm <- glm(count ~ ., tab1.df, family = poisson())
**> > > resid(tab1.glm, type = "pearson")
**> > 1 2 3 4
**> > 0.38092339 -3.13854927 -0.02751477 0.22670294
**> > > plot(tab1.glm)
**> >
**> > > ### assocplot ###
**> > > # displaying Pearson residuals via an assocplot
**> > > assocplot(t(tab1))
**> >
**>
**>
**> --
**> Weiwei Shi, Ph.D
**>
**> "Did you always know?"
**> "No, I did not. But I believed..."
**> ---Matrix III
**>
**> ______________________________________________
**> R-help@stat.math.ethz.ch mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-help
**> PLEASE do read the posting guide!
**> http://www.R-project.org/posting-guide.html
*

>

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sat Jul 09 03:28:49 2005

*
This archive was generated by hypermail 2.1.8
: Fri 03 Mar 2006 - 03:33:26 EST
*