Re: [R] memory problems when combining randomForests

From: Eleni Rapsomaniki <e.rapsomaniki_at_mail.cryst.bbk.ac.uk>
Date: Sat 29 Jul 2006 - 23:14:55 EST


Hello again,

The reason why I thought the order at which rows are passed to randomForest affect the error rate is because I get different results for different ways of splitting my positive/negative data.

First get the data (attached with this email) pos.df=read.table("C:/Program Files/R/rw2011/pos.df", header=T) neg.df=read.table("C:/Program Files/R/rw2011/neg.df", header=T) library(randomForest)
#The first 2 columns are explanatory variables (which incidentally are not
discriminative at all if one looks at their distributions), the 3rd is the class (pos or neg)

train2test.ratio=8/10
min_len=min(nrow(pos.df), nrow(neg.df))
class_index=which(names(pos.df)=="class") #is the same for neg.df train_size=as.integer(min_len*train2test.ratio)

############ Way 1

train.indicesP=sample(seq(1:nrow(pos.df)), size=train_size, replace=FALSE) train.indicesN=sample(seq(1:nrow(neg.df)), size=train_size, replace=FALSE)

trainP=pos.df[train.indicesP,]
trainN=neg.df[train.indicesN,]
testP=pos.df[-train.indicesP,]
testN=neg.df[-train.indicesN,]

mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-class_index], y=rbind(trainP, trainN)[,class_index], xtest=rbind(testP, testN)[,-class_index], ytest=rbind(testP, testN)[,class_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE) mydata.rf$test$confusion

############## Way 2

ind <- sample(2, min(nrow(pos.df), nrow(neg.df)), replace = TRUE, prob=c(train2test.ratio, (1-train2test.ratio)))

trainP=pos.df[ind==1,]
trainN=neg.df[ind==1,]
testP=pos.df[ind==2,]
testN=neg.df[ind==2,]

mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-dir_index], y=rbind(trainP, trainN)[,dir_index], xtest=rbind(testP, testN)[,-dir_index], ytest=rbind(testP, testN)[,dir_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE) mydata.rf$test$confusion

########### Way 3

subset_start=1
subset_end=subset_start+train_size
train_index=seq(subset_start:subset_end)

trainP=pos.df[train_index,]
trainN=neg.df[train_index,]
testP=pos.df[-train_index,]
testN=neg.df[-train_index,]

mydata.rf <- randomForest(x=rbind(trainP, trainN)[,-dir_index], y=rbind(trainP, trainN)[,dir_index], xtest=rbind(testP, testN)[,-dir_index], ytest=rbind(testP, testN)[,dir_index], importance=TRUE,proximity=FALSE, keep.forest=FALSE) mydata.rf$test$confusion

########### end

The first 2 methods give me an abnormally low error rate (compared to what I get using the same data on a naiveBayes method) while the last one seems more realistic, but the difference in error rates is very significant. I need to use the last method to cross-validate subsets of my data sequentially(the first two methods use random rows throughout the length of the data), unless there is a better way to do it (?). Something must be very different between the first 2 methods and the last, but which is the correct one?

I would greatly appreciate any suggestions on this!

Many Thanks
Eleni Rapsomaniki



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Sat Jul 29 23:23:24 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Sun 30 Jul 2006 - 02:16:58 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.