[R] Cross-validation in R

From: Luis Orlindo Tedeschi <luis.tedeschi_at_hotmail.com>
Date: Mon, 09 Jun 2008 14:43:03 -0500

Folks; I am having a problem with the cv.glm and would appreciate someone shedding some light here. It seems obvious but I cannot get it. I did read the manual, but I could not get more insight. This is a database containing 3363 records and I am trying a cross-validation to understand the process.

When using the cv.glm, code below, I get mean of perr1 of 0.2336 and SD of 0.000139. When using a home-made cross validation, code below, I get mean of perr2 of 0.2338 and SD of 0.02184. The means are similar but SD are different.

Questions are:

(1) how the $delta is computed in the cv.glm? In the home-made version, I
simply use ((Yobs - Ypred)^2)/n. The equation might be correct because the mean is similar.

(2) in the cv.glm, I have the impression the system is using glm0.dmi that
was generated using all the data points whereas in my homemade version I only use the "test" database. I am confused if the cv.glm generates new glm models for each simulation of if it uses the one provided?

(3) is the cv.glm sampling using replacement = TRUE or not?

Thanks in advance.



# Simulation for 50 re-samplings...
for (j in 1:50)

   cv.dmi<-cv.glm(data.dmi, glm0.dmi, K = 10)    perr1<-cv.dmi$delta[2]


# Brute-force cross-validation. This should be similar to the cv.glm perr2.vect <- vector()
for(j in 1:50)

   select.dmi <- sample(1:nrow(data.dmi), 0.9*nrow(data.dmi))
   train.dmi <- data.dmi[select.dmi,] #Selecting 90% of the data for training purpose

   test.dmi <- data.dmi[-select.dmi,] #Selecting 10% (remaining) of the data for testing purpose

   glm1.dmi <- glm(DMI_kg~Sex+DOF+Avg_Nem+In_Wt, na.action=na.omit, data = train.dmi)

   #Create fitted values using test.dmi data    dmi_pred <- predict.glm(glm1.dmi, test.dmi)    dmi_obs<-test.dmi[,"DMI_kg"]
   # Get the prediction error = MSE
   perr2 <- t(dmi_obs - dmi_pred)%*%(dmi_obs - dmi_pred)/nrow(test.dmi)    perr2.vect <- c(perr2.vect, perr2)


R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Mon 09 Jun 2008 - 19:48:27 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 10 Jun 2008 - 12:30:38 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive