[R] Cross validation, one more time (hopefully the last)

From: Trevor Wiens <twiens_at_interbaun.com>
Date: Thu 17 Mar 2005 - 11:59:01 EST


I apologize for posting on this question again, but unfortunately, I don't have and can't get access to MASS for at least three weeks. I have found some code on the web however which implements the prediction error algorithm in cv.glm.

http://www.bioconductor.org/workshops/NGFN03/modelsel-exercise.pdf

Now I've tried to adapt it to my purposes, but since I'm not deeply familiar with R programming, I don't know why it doesn't work. Now checking the r-help list faq it seems this is an appropriate question.

I've included my attempted function below. The error I get is:

logcv(basp.data, form, 'basp', 'recordyear') Error in order(na.last, decreasing, ...) :

        Argument 1 is not a vector

My questions are, why doesn't this work, and how do I fix it.

I'm using the formula function to create the formula that I'm sending to my function. And the mdata is a data.frame. I'm assumed that if I passed the column names as strings (response variable - rvar, fold variable - fvar) this would work. Apparently however it doesn't.

Lastly, since I don't have access to MASS and there are apparently many examples of doing this kind of thing in MASS, could someone tell me if this function looks approximately correct?

Thanks

T


logcv <- function(mdata, formula, rvar, fvar) { require(Hmisc)

# sort by fold variable
sorted <- mdata[order(mdata$fvar), ]

# get fold values and count for each group vardesc <- describe(sorted$fvar)$values
fvarlist <- as.integer(dimnames(vardesc)[[2]]) k <- length(fvarlist)
countlist <- vardesc[1,1]
for (i in 2:k)
{
countlist[i] <- vardesc[1,i]
}

n <- length(sorted$fvar)

# fit to all the mdata
fit.all <- glm(formula, sorted, family=binomial) pred.all <- ifelse( predict(fit.all, type="response") < 0.5, 0, 1)

#setup
pred.c <- list()
error.i <- vector(length=k)

for (i in 1:k)
{
fit.i <- glm(formula, subset(sorted, sorted$fvar != fvarlist[i]), family=binomial)

pred.i <- ifelse(predict(fit.i, newdata=subset(sorted, sorted$fvar == fvarlist[i]), type="response") < 0.5, 0, 1)
pred.c[[i]] = pred.i
pred.all.i <- ifelse(predict(fit.i, newdata=sorted, type="response") < 0.5, 0, 1)
error.i[i] <- sum(sorted$rvar != pred.all.i)/n
}

pred.cc <- unlist(pred.c)
delta.cv.k <- sum(sorted$rvar != pred.cc)/n p.k <- countlist/n
delta.app <- mean(sorted$rvar != pred.all)/n

delta.acv.k <- delta.cv.k + delta.app - sum(p.k*error.i)

print(delta.acv.k)
}

-- 
Trevor Wiens 
twiens@interbaun.com

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Thu Mar 17 12:04:21 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:30:50 EST