**From:** *clayton.springer@pharma.novartis.com*

**Date:** Sat 08 May 2004 - 01:26:04 EST

**Next message:**Nathan Whitehouse: "[R] Re: Sessioned R web interfaces"**Previous message:**Martin Maechler: "Re: [R] help with histogram"

Message-id: <OFAD8068AD.FEBBB0E6-ON85256E8D.004DC559-85256E8D.00547806@EU.novartis.net>

Dear r-help,

The following dataset (generated with perl) has 10 observations of 100

dependant variables (integers drawn uniformly

from [1:9]) which is split evenly between two classes..

First I show some work, and then ask two questions at the end.

*> data <- read.table ("rf_input.dat")
*

*> library (randomForest)
*

# if we do randomForest one time it looks like this:

*> rf <- randomForest (factor(V101) ~. ,data=data)
*

*> rf$confusion
*

1 2 class.error

1 5 5 0.5

2 4 6 0.4

# now we do it 100 times

*>
*

tnum <- numeric()

for (i in 1:100) { MT <- data$V101

MT.rf <- randomForest (factor(MT) ~ . ,data =data[-c(101)])

number <- as.integer (summary ( predict(MT.rf) == MT)[3] )

tnum <- c(tnum,number)

}

*> > > > + + + + + + + >
*

# and this distribution of results (about 13 correct out of 20)

*> quantile (tnum,probs = seq (0,1,0.1),na.rm = T)
*

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

9 11 12 12 13 13 13 14 14 15 17

# now lets permute (re-randomize?) the classes and repeat 1000 times:

*> library (gregmisc)
*

tnum <- numeric()

for (i in 1:1000) { MT <- permute (data$V101)

MT.rf <- randomForest (factor(MT) ~ . ,data =data[-c(101)])

number <- as.integer (summary ( predict(MT.rf) == MT)[3] )

tnum <- c(tnum,number)

}

# I get these results: the average is about 8 correct (out of 20) with 13

correct being at about

# the 95% confidence level

*> quantile (tnum,probs = seq (0,1,0.1),na.rm = T)
*

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

1 4 5 6 7 8 8 9 10 12 18

*> quantile (tnum,probs = seq (0.9,1,0.01),na.rm = T)
*

90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 100%

12 12 12 12 12 13 13 14 14 15 18

--------

My two questions:

Question 1: Naively I might have expected to get 10/20 for the Y-scrambled

examples, but instead I got 8/20. Why is that?

(Persumably has something to do with the randomForest only training on 2/3

of the examples.)

Question 2: With my Y scrambling exercise I seem to have demonstrated that

the original dataset was not random. But yet it

is random by construction. Is this just a fluke, or is something wrong

with my protocol?

thanks in advance,

Clayton

______________________________________________

R-help@stat.math.ethz.ch mailing list

https://www.stat.math.ethz.ch/mailman/listinfo/r-help

PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

**Next message:**Nathan Whitehouse: "[R] Re: Sessioned R web interfaces"**Previous message:**Martin Maechler: "Re: [R] help with histogram"

*
This archive was generated by hypermail 2.1.3
: Mon 31 May 2004 - 23:05:08 EST
*