[R] coding to generate a matrix to prepare for chi-sqr test for text mining

From: Weiwei Shi <helprhelp_at_gmail.com>
Date: Thu 16 Jun 2005 - 07:10:27 EST

Hi, there:
I have a dataset like the following:



The first column is always id_num, the last one is class label. I want to do some chi-square test on the dependency between a word (or further a word combination) on the class label.

for example, my goal is to build a table like the following, ready for chi-square test

                      ACCID (Yes)                 ACCID(No)
class label
         1                  10                                15
         0                    5                                 9

the number is the number of lines (observations). and later I want to do word-combination like ACCID & WINDOW (this result was generated from association analysis from my another program) instead of ACCID only.

My first question is, how to do it automatically in R to build a data structure (data frame) to represent the table above for each word) since I am learning R programming and I don't want to do it using python. (Don't worry if a word appears twice in one observation, and I have another version of data set which only lists unique word.)

My target is to find a p-value for each word/class label from chi-square test and evaluate the significance of feature for later text mining. I am not sure if this is a good idea and I am reading some papers on this.


Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Thu Jun 16 07:18:04 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:32:43 EST