Re: [R] RandomForest vs. bayes & svm classification performance

From: Jameson C. Burt <>
Date: Fri 28 Jul 2006 - 05:22:35 EST

With remiss, I haven't tried these R tools. However, I tried a dozen Naive Bayes-like programs, often used to filter email, where the serious problem with spam has resulted in many innovations.
The most touted of the worldwide Naive Bayes programs seems to be CRM114 (not in R, I expect, since its programming is peculiar), whose 275 pages of documentation is at However, unless you have several weeks and some flexible programming skills, don't consider it.
It took me about 3 months to find that crm114 worked best, then another month to break thru his documentation to control his program from a single Perl program with no external parameter files. Crm114 can form groups of 5 words as word word, taking all combinations of 5 consecutive words in documents.
Using 5 words produced better results than any filters I used; eg, filtering/altering car manufacturer's standard form prompts like

   Fire? Yes_ No_

Initially, I expected correct results of 99% or better, like my use of Naive Bayes to filter my email. However, email must accomplish some goal (go to their webpage or see their low cost), so Naive Bayes approaches work very well on email.

U.S. Department of Transportation (DOT), defects investigation, contracted with me to try what I'd successfully used for email (others' programs). They were accumulating 50,000 early warning reports a quarter, yet their engineers had read only 3,000. DOT contracted for a dozen people to slug thru the accumulated 300,000 reports, identifying those that might portend the necessity of a recall. But these contractors (probably costing $1 million a year) agreed with the engineers no more than 50% of the time.

After 2 months, I was able to correctly identify only 30% of reports. Then I read that Naive Bayes was, after all, "naive". It presumed independence between words.
There's an old statitical saying,
  "Do you prefer to perfectly solve the wrong problem,   or wrongly solve the correct problem?" People using Naive Bayes use many heuristics, as the CRM114 documents mention, including,
a. TOE, "Train on Error"

   for which you retrain any document that Naive Bayes classifies    incorrectly.
   Statistically, this is somewhat like having a learned population    with more than one of the same document. b. SSTTT, "Single Sided Thick Threshhold Training"

   for which you retrain a document when it doesn't identify correctly    with a sufficiently high probability. c. TUNE, "Train Until No Error"

   for which you recycle thru your known records until you    reach perfection, although often forced a stop when no improvement    resulted after 12 cycles.
All these techniques improved correct identification and concentration (proportion of "flagged" reports that are correctly flagged) to about 67%.

Then the engineers (gearheads) did the inexplicable -- they read about 20,000 reports, jumping the correctness of the crm114 Naive Bayes approach with the above heuristics to about 88%. Suddenly, crm114 Naive Bayes "flagged" reports were fun to read. For example, a report no-one had yet identified described a fellow's car modified with airbags to lift the car to a high height using canisters of some air in the back of his pickup. Driving down the road, he notice a warning light flashing on his air supply.
Soon afterwards, the passenger seat caught fire. Even though his pickup was moving down the road, the flashing warning light and flaming passenger seat prompted him to open his driver's door and leap from his moving pickup.

While I worked the Bayesian approach and contractors read reports as two approaches to slug thru 300,000 reports, big software/contractor companies hovered over the spending and potential spending.
But their approaches were all judged foolish -- expensively foolish.

So, if you really have a problem worthy of solving well, some time, and some programming skills,
you can integrate a Naive Bayes procedure with some heuristic procedures, probably with good correct identification and a high concentration of correctly "flagged" documents among Bayes flagged documents.

On Mon, Jul 24, 2006 at 06:59:31PM +0100, Eleni Rapsomaniki wrote:
> Hi
> This is a question regarding classification performance using different methods.
> So far I've tried NaiveBayes (klaR package), svm (e1071) package and
> randomForest (randomForest). What has puzzled me is that randomForest seems to
> perform far better (32% classification error) than svm and NaiveBayes, which
> have similar classification errors (45%, 48% respectively). A similar
> difference in performance is observed with different combinations of
> parameters, priors and size of training data.
> Because I was expecting to see little difference in the perfomance of these
> methods I am worried that I may have made a mistake in my randomForest call:
> my.rf=randomForest(x=train.df[,-response_index], y=train.df[,response_index],
> xtest=test.df[,-response_index], ytest=test.df[,response_index],
> importance=TRUE,proximity=FALSE, keep.forest=FALSE)
> (where train.df and test.df are my train and test data.frames and response_index
> is the column number specifiying the class)
> My main question is: could there be a legitimate reason why random forest would
> outperform the other two models (e.g. maybe one
> method is more reliable with Gaussian data, handles categorical data
> better etc)? Also, is there a way of evaluating the predictive ability of each
> parameter in the bayesian model as it can be done for random Forests (through
> the importance table)?
> I would appreciate any of your comments and suggestions on these.
> Many thanks
> Eleni Rapsomaniki
> ______________________________________________
> mailing list
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.

Jameson C. Burt, NJ9L   Fairfax, Virginia, USA
(202) 690-0380 (work)  magic "mysterious and awe-inspiring even though
                  we know they are real and not supernatural"

______________________________________________ mailing list
PLEASE do read the posting guide
and provide commented, minimal, self-contained, reproducible code.
Received on Fri Jul 28 05:30:01 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Fri 28 Jul 2006 - 06:17:24 EST.

Mailing list information is available at Please read the posting guide before posting to the list.