[R] Cross validation, one more time (hopefully the last)

From: Trevor Wiens <twiens_at_interbaun.com>
Date: Thu 17 Mar 2005 - 11:59:01 EST

I apologize for posting on this question again, but unfortunately, I don't have and can't get access to MASS for at least three weeks. I have found some code on the web however which implements the prediction error algorithm in cv.glm.


Now I've tried to adapt it to my purposes, but since I'm not deeply familiar with R programming, I don't know why it doesn't work. Now checking the r-help list faq it seems this is an appropriate question.

I've included my attempted function below. The error I get is:

logcv(basp.data, form, 'basp', 'recordyear') Error in order(na.last, decreasing, ...) :

        Argument 1 is not a vector

My questions are, why doesn't this work, and how do I fix it.

I'm using the formula function to create the formula that I'm sending to my function. And the mdata is a data.frame. I'm assumed that if I passed the column names as strings (response variable - rvar, fold variable - fvar) this would work. Apparently however it doesn't.

Lastly, since I don't have access to MASS and there are apparently many examples of doing this kind of thing in MASS, could someone tell me if this function looks approximately correct?



logcv <- function(mdata, formula, rvar, fvar) { require(Hmisc)

# sort by fold variable
sorted <- mdata[order(mdata$fvar), ]

# get fold values and count for each group vardesc <- describe(sorted$fvar)$values
fvarlist <- as.integer(dimnames(vardesc)[[2]]) k <- length(fvarlist)
countlist <- vardesc[1,1]
for (i in 2:k)
countlist[i] <- vardesc[1,i]

n <- length(sorted$fvar)

# fit to all the mdata
fit.all <- glm(formula, sorted, family=binomial) pred.all <- ifelse( predict(fit.all, type="response") < 0.5, 0, 1)

pred.c <- list()
error.i <- vector(length=k)

for (i in 1:k)
fit.i <- glm(formula, subset(sorted, sorted$fvar != fvarlist[i]), family=binomial)

pred.i <- ifelse(predict(fit.i, newdata=subset(sorted, sorted$fvar == fvarlist[i]), type="response") < 0.5, 0, 1)
pred.c[[i]] = pred.i
pred.all.i <- ifelse(predict(fit.i, newdata=sorted, type="response") < 0.5, 0, 1)
error.i[i] <- sum(sorted$rvar != pred.all.i)/n

pred.cc <- unlist(pred.c)
delta.cv.k <- sum(sorted$rvar != pred.cc)/n p.k <- countlist/n
delta.app <- mean(sorted$rvar != pred.all)/n

delta.acv.k <- delta.cv.k + delta.app - sum(p.k*error.i)


Trevor Wiens 

R-help@stat.math.ethz.ch mailing list
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Thu Mar 17 12:04:21 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:30:50 EST