[R] randomForests and Y-scrambling on a small synthetic dataset

About this list Date view Thread view Subject view Author view Attachment view

From: clayton.springer@pharma.novartis.com
Date: Sat 08 May 2004 - 01:26:04 EST


Message-id: <OFAD8068AD.FEBBB0E6-ON85256E8D.004DC559-85256E8D.00547806@EU.novartis.net>


Dear r-help,

The following dataset (generated with perl) has 10 observations of 100
dependant variables (integers drawn uniformly
from [1:9]) which is split evenly between two classes..

First I show some work, and then ask two questions at the end.

> data <- read.table ("rf_input.dat")
> library (randomForest)
# if we do randomForest one time it looks like this:

> rf <- randomForest (factor(V101) ~. ,data=data)
> rf$confusion
  1 2 class.error
1 5 5 0.5
2 4 6 0.4

# now we do it 100 times

>
tnum <- numeric()

for (i in 1:100) { MT <- data$V101
   MT.rf <- randomForest (factor(MT) ~ . ,data =data[-c(101)])
   number <- as.integer (summary ( predict(MT.rf) == MT)[3] )
   tnum <- c(tnum,number)
}

> > > > + + + + + + + >

# and this distribution of results (about 13 correct out of 20)
> quantile (tnum,probs = seq (0,1,0.1),na.rm = T)
  0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
   9 11 12 12 13 13 13 14 14 15 17

# now lets permute (re-randomize?) the classes and repeat 1000 times:

> library (gregmisc)
tnum <- numeric()

for (i in 1:1000) { MT <- permute (data$V101)
   MT.rf <- randomForest (factor(MT) ~ . ,data =data[-c(101)])
   number <- as.integer (summary ( predict(MT.rf) == MT)[3] )
   tnum <- c(tnum,number)
}

# I get these results: the average is about 8 correct (out of 20) with 13
correct being at about
# the 95% confidence level

> quantile (tnum,probs = seq (0,1,0.1),na.rm = T)
  0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
   1 4 5 6 7 8 8 9 10 12 18
> quantile (tnum,probs = seq (0.9,1,0.01),na.rm = T)
 90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 100%
  12 12 12 12 12 13 13 14 14 15 18

--------

My two questions:

Question 1: Naively I might have expected to get 10/20 for the Y-scrambled
examples, but instead I got 8/20. Why is that?
(Persumably has something to do with the randomForest only training on 2/3
of the examples.)

Question 2: With my Y scrambling exercise I seem to have demonstrated that
the original dataset was not random. But yet it
is random by construction. Is this just a fluke, or is something wrong
with my protocol?

thanks in advance,

Clayton



______________________________________________
R-help@stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


About this list Date view Thread view Subject view Author view Attachment view

This archive was generated by hypermail 2.1.3 : Mon 31 May 2004 - 23:05:08 EST