Re: [R] Essay identification

From: Ted Harding <Ted.Harding_at_nessie.mcc.ac.uk>
Date: Mon 13 Jun 2005 - 09:47:05 EST

On 12-Jun-05 Berton Gunter wrote:
> I assume that you know the usual procedure is to 'score'
> each essay by a vector that gives the frequency of occurrence
> of commonly used (sometimes adding subject matter specific)
> words and phrases. This multivariate response is then fed in
> as a "training set" into your favorite supervised
> learning/classification procedure. R has many of these -- trees,
> logisic regression, boosting, Random Forests,svm's,LDA,SOM's
> (whoops -- that's an Unsupervised one), ... . Try
> RSiteSearch('Classification',restrict=('functions').
>
> The devil is in the details as to what works best, I believe.
> With only 78 exemplars in 10 groups, unless there is a lot of
> separation (disparate styles that you could probably detect
> manually) it may be difficult. It also depends on how large
> each group is (balance is generally better).
>
> Cheers,
> Bert

I would add to Berton's list such scores as numbers of different words used, sentence lengths, relative frequencies of verbs, nouns, adjectives, adverbs, and so on, perhaps scaled by overall length. Length of Essay might even be a discriminant!

You could also look at more subtle characteristics such as "Zipf bins"[*] -- the relative numbers of different words which occur once only, twice, three times, ... (though I'm not sure how you would score such a thing for classification purposes).
[*] A term I've just invented inspired by the original instance

    of this by the linguist Zipf, later giving rise to the     logarithmic distribution in the historic paper by Fisher,     Corbett & Williams in the "Numbers of Species and Numbers     of Individuals" in butterfly traps.

If you really want to go to town you can try things related to grammatical complexity, e.g. numbers of subordinate clauses per sentence, relative clauses, the "reach" of relative pronouns (how far from the referring pronoun is the thing referred to) and so on.

There's quite an extensive literature on this sort of thing. though it's not as fashionable as it used to be.

Th real problem is that you can get carried away by "good ideas" of things to try!

The other factor to bear in mind is that if the Essays can be grouped by subject this is likely to influence many of the scores (such as the above).

Hoping this helps and does not distract! Ted.



E-Mail: (Ted Harding) <Ted.Harding@nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861
Date: 13-Jun-05                                       Time: 00:43:10
------------------------------ XFMail ------------------------------

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Mon Jun 13 10:09:07 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:32:33 EST