From: Joshua Knowles <j.knowles_at_manchester.ac.uk>

Date: Wed, 19 Dec 2007 11:43:59 +0000

Date: Wed, 19 Dec 2007 11:43:59 +0000

I have observed that when using the randomForest package to do regression, the predicted values of the dependent variable given by a trained forest are not centred and have the wrong slope when plotted against the true values.

This means that the R^2 value obtained by squaring the Pearson correlation are better than those obtained by computing the coefficient of determination directly. The R^2 value obtained by squaring the Pearson can, however, be exactly reproduced by the coeff. of det. if the predicted values are first linearly transformed (using lm() to find the required intercept and slope).

Does anyone know why the randomForest behaves in this way - producing offset predictions? Does anyone know a fix for the problem?

(By the way, the feature is there even if the original dependent variable values are initially transformed to have zero mean and unit variance.)

As an example, here is some simple R code that uses the available swiss dataset to show the effect I am observing.

Thanks for any help.

--Received on Wed 19 Dec 2007 - 11:47:36 GMT

#### EXAMPLE OF RANDOM FOREST REGRESSION

library(randomForest) data(swiss) swiss

#Build the random forest to predict Infant Mortality

rf.rf<-randomForest(Infant.Mortality ~ ., data=swiss)

#And predict the training set again

pred<-c(predict(rf.rf,swiss)) actual<-swiss$Infant.Mortality

#Plotting predicted against actual values shows the effect (uncomment to see

this)

#plot(pred,actual)

#abline(0,1)

# calculate R^2 as pearson coefficient squared

R2one<-cor(pred,actual)^2

# calculate R^2 value as fraction of variance explained

residOpt<-(actual-pred) residnone<-(actual-mean(actual)) R2two<-1-var(residOpt,na.rm=TRUE)/var(residnone, na.rm=TRUE)

# now fit a line through the predicted and true values and

# use this to normalize the data before calculating R^2

fit<-lm(actual ~ pred) coef(fit) pred2<-pred*coef(fit)[2]+coef(fit)[1] residOpt<-(actual-pred2) R2three<-1-var(residOpt,na.rm=TRUE)/var(residnone, na.rm=TRUE) cat("Pearson squared = ",R2one,"\n") cat("Coeff of determination = ", R2two, "\n") cat("Coeff of determination after linear fitting = ", R2three, "\n")

## END

-- Joshua Knowles .. j.knowles_at_manchester.ac.uk BBSRC David Phillips Fellow School of Computer Science The University of Manchester http://dbkgroup.org/knowles/ ______________________________________________ R-help_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Thu 20 Dec 2007 - 20:30:20 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*