From: Eleni Rapsomaniki <e.rapsomaniki_at_mail.cryst.bbk.ac.uk>

Date: Sat 29 Jul 2006 - 23:14:55 EST

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat Jul 29 23:23:24 2006

Date: Sat 29 Jul 2006 - 23:14:55 EST

Hello again,

The reason why I thought the order at which rows are passed to randomForest affect the error rate is because I get different results for different ways of splitting my positive/negative data.

First get the data (attached with this email)
pos.df=read.table("C:/Program Files/R/rw2011/pos.df", header=T)
neg.df=read.table("C:/Program Files/R/rw2011/neg.df", header=T)
library(randomForest)

#The first 2 columns are explanatory variables (which incidentally are not

discriminative at all if one looks at their distributions), the 3rd is the
class (pos or neg)

train2test.ratio=8/10

min_len=min(nrow(pos.df), nrow(neg.df))

class_index=which(names(pos.df)=="class") #is the same for neg.df
train_size=as.integer(min_len*train2test.ratio)

############ Way 1

train.indicesP=sample(seq(1:nrow(pos.df)), size=train_size, replace=FALSE)
train.indicesN=sample(seq(1:nrow(neg.df)), size=train_size, replace=FALSE)

trainP=pos.df[train.indicesP,] trainN=neg.df[train.indicesN,] testP=pos.df[-train.indicesP,] testN=neg.df[-train.indicesN,]

mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-class_index], y=rbind(trainP, trainN)[,class_index], xtest=rbind(testP, testN)[,-class_index], ytest=rbind(testP, testN)[,class_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE) mydata.rf$test$confusion

############## Way 2

ind <- sample(2, min(nrow(pos.df), nrow(neg.df)), replace = TRUE,
prob=c(train2test.ratio, (1-train2test.ratio)))

trainP=pos.df[ind==1,] trainN=neg.df[ind==1,] testP=pos.df[ind==2,] testN=neg.df[ind==2,]

mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-dir_index], y=rbind(trainP, trainN)[,dir_index], xtest=rbind(testP, testN)[,-dir_index], ytest=rbind(testP, testN)[,dir_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE) mydata.rf$test$confusion

########### Way 3

subset_start=1

subset_end=subset_start+train_size

train_index=seq(subset_start:subset_end)

trainP=pos.df[train_index,] trainN=neg.df[train_index,] testP=pos.df[-train_index,] testN=neg.df[-train_index,]

mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-dir_index], y=rbind(trainP, trainN)[,dir_index], xtest=rbind(testP, testN)[,-dir_index], ytest=rbind(testP, testN)[,dir_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE) mydata.rf$test$confusion

########### end

The first 2 methods give me an abnormally low error rate (compared to what I get using the same data on a naiveBayes method) while the last one seems more realistic, but the difference in error rates is very significant. I need to use the last method to cross-validate subsets of my data sequentially(the first two methods use random rows throughout the length of the data), unless there is a better way to do it (?). Something must be very different between the first 2 methods and the last, but which is the correct one?

I would greatly appreciate any suggestions on this!

Many Thanks

Eleni Rapsomaniki

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat Jul 29 23:23:24 2006

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.1.8, at Sun 30 Jul 2006 - 02:16:58 EST.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*