From: Huntsinger, Reid <reid_huntsinger_at_merck.com>

Date: Thu 16 Jun 2005 - 08:27:45 EST

...

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide!

http://www.R-project.org/posting-guide.html

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Thu Jun 16 08:37:57 2005

Date: Thu 16 Jun 2005 - 08:27:45 EST

I would compile a table of all the words in the dataset (maybe you have it
already), then create a list where each component is an integer vector of
indices of words. That is, replace words by their positions in the table.

*>From that sparse form you could create binary features to use with standard
*

classification methods, or for example compute the X'X matrix for linear
regression directly (you would probably want to throw out infrequently
occurring words to keep the matrix small enough to work with in memory). For
your specific question, say "words" is the list of integer vectors as above,
and "class" is the vector of class labels (1 or 2 to make it a valid index)
corresponding to a given vector. Then you can fill in the "present" (==1)
parts of the table class x presence x word via

n <- length(words)

tab <- array(as.integer(0),dim=c(2,2,n))

for (i in 1:n) {

for (word in words[[i]]) tab[class[i],1,word] <- tab[class[i],1,word] + 1
}

and the "absent" (==2) parts are then easy:

tab[1,2,] <- sum(class == 1) - tab[1,1,] tab[2,2,] <- sum(class == 2) - tab[2,1,]

so now you can use chisq.test on each of the 2 x 2 tables tab[,,i] for i a word index, all at once using apply() if convenient.

Reid Huntsinger

-----Original Message-----

From: r-help-bounces@stat.math.ethz.ch

[mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Weiwei Shi
Sent: Wednesday, June 15, 2005 5:10 PM

To: R-help@stat.math.ethz.ch

Subject: [R] coding to generate a matrix to prepare for chi-sqr test for
text mining

Hi, there:

I have a dataset like the following:

1412|WINDOW|SHATTER|TORN|SOFT|TOP|WATER|RAIN|LAB|AI|BOLL|CAMP|0 1413|PARK|IV|STRUCK|PARK|PUSH|COD|POLICI|CIA|TB|SIC|0 2412|ACCID|REAREND|MULTI|EH|IV|MIDDL|FAN|DUAL|LOSS|CALM|1 2414|IV|REAREND|CD|COG|LAB|ADVERS|1 2415|ACCID|SINGL|VEHICL|IV|SWERV|AVOID|OBJECT|STRUCK|PHONE|POLE|FAN|0 2417|ACCID|SINGL|VEHICL|ROLL|DUE|FATAL|FAN|DUAL|LOSS|CALM|1 2418|AI|FELL|ASLEEP|WHEEL|VEHICL|RETENT|POND|LAB|ADVERS|1 2419|ACCID|SINGL|VEHICL|TREE|FELL|IV|LIGHTN|STORM|IV|CAMP|CALM|AD|1 2422|THEFT|RECOV|TOTAL|THEFT|0

...

The first column is always id_num, the last one is class label. I want to do some chi-square test on the dependency between a word (or further a word combination) on the class label.

for example, my goal is to build a table like the following, ready for chi-square test

ACCID (Yes) ACCID(No) class label 1 10 15 0 5 9

the number is the number of lines (observations). and later I want to do word-combination like ACCID & WINDOW (this result was generated from association analysis from my another program) instead of ACCID only.

My first question is, how to do it automatically in R to build a data structure (data frame) to represent the table above for each word) since I am learning R programming and I don't want to do it using python. (Don't worry if a word appears twice in one observation, and I have another version of data set which only lists unique word.)

My target is to find a p-value for each word/class label from chi-square test and evaluate the significance of feature for later text mining. I am not sure if this is a good idea and I am reading some papers on this.

Thanks,

--

Weiwei Shi, Ph.D

"Did you always know?"

"No, I did not. But I believed..."

---Matrix III

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide!

http://www.R-project.org/posting-guide.html

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Thu Jun 16 08:37:57 2005

*
This archive was generated by hypermail 2.1.8
: Fri 03 Mar 2006 - 03:32:43 EST
*