[R] text mining - text comparing

From: Matev¾ Pavlič <matevz.pavlic_at_gi-zrmk.si>
Date: Wed, 25 May 2011 22:49:15 +0200


Hi all,  

I'll try to explain what i would like to achieve.

I have two problmes that i would need help on if someone has a clue.    

1.) I have a TXT file containing two fields : USCS and Description.  

For each field of USCS I have a field Descrition that contained a lot of words that describe that particular USCS type. What i would like to do is tomine the text using tm package in order to find which words in Description filed are the most frequent for each USCS field.  

Now i don't think i will have problems with that part, but the problem is importing the data. The thing is that there is areound 300 different USCS - Descritption combinations which is of course to much to sort out by hand. I would have to create a Corpus of around 300 texts which I could later anylize. Here is where i get stuck. I can not find a way to import the data in a Corpus so that i would have a text named after USCS value and containing strings (words) of Desription field.  

Attached (temp.txt) is a small dataset.  

2.) Second thing is about comparing text. I have some problems with typos in a text, so what i would like is to find a words that are similar (but spelled incorrectly). Similar that when typing in google engine, you get prposed words. Has anyone had any experiance in that?  

I hope i explaine ok, otherwise i'll try again,  

Tnx, m



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Received on Wed 25 May 2011 - 21:05:51 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 25 May 2011 - 21:10:10 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive