[R] logistic model cross validation resolved

From: Trevor Wiens <twiens_at_interbaun.com>
Date: Fri 18 Mar 2005 - 15:59:44 EST

This post is NOT a question, but an answer. For readers please disregard all earlier posts by myself about this question.

I'm posting for two reasons. First to say thanks, especially to Dimitris, for suggesting the use of errorest in the ipred library. Second, so that the solution to this problem is in the archives in case it gets asked again.  

If one wants to run a k-fold cross-validation using specified folds, and get misclassification error and root mean squared error this is what you do.

Below is a script that will do this returning the results of two errorest results combined into a data frame. Now this script assumes that the variable you are going to fold with is an integer. If you want to have the predicted values or models returned from each fold, the calls to errorest, can be modified. Please see the help page for control.errorest on details.


Trevor Wiens 

myglm <- function(formula, family, data){
ret <- glm(formula, family=binomial, data=data)

myfacpred <- function(object, newdata) {
ret <- as.factor(ifelse(predict.glm(object, newdata, type='response') < 0.5, 0, 1))

# logerrorest takes four arguments
# mdata is a data frame holding the data to be modeled
# form is the output of the is.formula function
# rvar is the response variable as a string, for example 'birdx'
# fvar is the fold variable, for example 'recordyear'
logerrorest <- function(mdata, form, rvar, fvar) { require(Hmisc) require(ipred)
# determine index of variables
rpos <- match(rvar, names(mdata)) fpos <- match(fvar, names(mdata))
# get fold values and count for each group
vardesc <- describe(mdata[[fpos]])$values fvarlist <- as.integer(dimnames(vardesc)[[2]]) k <- length(fvarlist) countlist <- vardesc[1,1] for (i in 2:k) { countlist[i] <- vardesc[1,i] } n <- length(mdata[[fpos]])
# get index list for each fold
cc <- list() for (i in 1:k) { cc[[i]] <- as.integer(rownames(mdata[mdata[[fpos]] == fvarlist[i], ])) }
# determine root mean squared error
ee <- errorest(form, mdata, estimator='cv', model=myglm, est.para=control.errorest(list.tindx = cc))
# determine misclassification error
# first convert to factor
width = length(mdata) response <- as.factor(as.integer(mdata[[rpos]])) newmatrix <- data.frame(cbind(mdata[1:(rpos-1)], response, mdata[(rpos+1):width])) newform <- as.formula(paste('response ~ ', as.character(form)[[3]])) me <- errorest(newform, newmatrix, estimator='cv', model=myglm, predict=myfacpred, est.para=control.errorest(list.tindx = cc)) ret <- data.frame(cbind(ee, me)) return(ret) } ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Fri Mar 18 16:04:04 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:30:51 EST