# [R] randomForests and Y-scrambling on a small synthetic dataset

From: clayton.springer@pharma.novartis.com
Date: Sat 08 May 2004 - 01:26:04 EST

Dear r-help,

The following dataset (generated with perl) has 10 observations of 100
dependant variables (integers drawn uniformly
from [1:9]) which is split evenly between two classes..

First I show some work, and then ask two questions at the end.

> library (randomForest)
# if we do randomForest one time it looks like this:

> rf <- randomForest (factor(V101) ~. ,data=data)
> rf\$confusion
1 2 class.error
1 5 5 0.5
2 4 6 0.4

# now we do it 100 times

>
tnum <- numeric()

for (i in 1:100) { MT <- data\$V101
MT.rf <- randomForest (factor(MT) ~ . ,data =data[-c(101)])
number <- as.integer (summary ( predict(MT.rf) == MT)[3] )
tnum <- c(tnum,number)
}

> > > > + + + + + + + >

# and this distribution of results (about 13 correct out of 20)
> quantile (tnum,probs = seq (0,1,0.1),na.rm = T)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
9 11 12 12 13 13 13 14 14 15 17

# now lets permute (re-randomize?) the classes and repeat 1000 times:

> library (gregmisc)
tnum <- numeric()

for (i in 1:1000) { MT <- permute (data\$V101)
MT.rf <- randomForest (factor(MT) ~ . ,data =data[-c(101)])
number <- as.integer (summary ( predict(MT.rf) == MT)[3] )
tnum <- c(tnum,number)
}

# I get these results: the average is about 8 correct (out of 20) with 13
# the 95% confidence level

> quantile (tnum,probs = seq (0,1,0.1),na.rm = T)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
1 4 5 6 7 8 8 9 10 12 18
> quantile (tnum,probs = seq (0.9,1,0.01),na.rm = T)
90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 100%
12 12 12 12 12 13 13 14 14 15 18

--------

My two questions:

Question 1: Naively I might have expected to get 10/20 for the Y-scrambled
examples, but instead I got 8/20. Why is that?
(Persumably has something to do with the randomForest only training on 2/3
of the examples.)

Question 2: With my Y scrambling exercise I seem to have demonstrated that
the original dataset was not random. But yet it
is random by construction. Is this just a fluke, or is something wrong
with my protocol?