[R] loops & sampling

From: <Garth.Warren_at_csiro.au>
Date: Thu, 1 Nov 2007 17:33:22 +1100


I'm new to R (and statistics) and my boss has thrown me in the deep-end with the following task:  

We want to evaluate the impact that sampling size has on our ability to create a robust model, or evaluate how robust the model is to sample size for the purpose of cross-validation i.e. in our current project we have collected a series of independent data at 250 locations, from which we have built a predictive model, we want to know whether we could get away with collecting fewer samples and still build a decent model; for the obvious operational reasons of cost, time spent in the field etc..  

Our thinking was that we could apply a bootstrap type procedure:  

We would remove 10 records or samples from the total n=250 and then replace those 10 removed with replacements (or copies) from the remaining 240. With this new data-frame we would apply our model and calculate an rē, we would then repeat through looping 1000 times before generating the mean rē from those 1000 rē values generated. After which we would start the process again by remove 20 samples from our data with replacements from the remaining 230 records and so on...  

Below is a simplified version of the real code which contains most of the basic elements. My main problem is I'm not sure what the 'for(i in 1:nboot)' line is doing, originally I though what this meant was that it removed 1 sample or record from the data which was replaced by a copy of one of the records from the remaining n, such that 'for(i in 10:nboot)' when used in the context of the below code removed 10 samples with replacements as I have said above. I'm almost positive that this isn't happening and if not how can I make the code below for example do what we want it to?  



a <- c(5.5, 2.3, 8.5, 9.1, 8.6, 5.1)

b <- c(5.2, 2.2, 8.6, 9.1, 8.8, 5.7)

c <- c(5.0,14.6, 8.9, 9.0, 9.1, 5.5)


abc <- data.frame(a,b,c)

#set column names




abc2 <- abc


abc3 <- as.data.frame(t(as.matrix(data.frame(abc2))))

n <- length(abc2)

npboot.function <- function(nboot)


boot.cor <- vector(length=nboot)

for(i in 1:nboot){

rdata <- sample(abc3,n,replace=T)

abc4 <- as.data.frame(t(as.matrix(data.frame(rdata))))

model <- lm(asin(sqrt(abc4$y/100)) ~ I(abc4$x1^2) + abc4$x2)

boot.cor[i] <- cor(abc4$y, model$fit)}



bt.cor <- npboot.function(nboot=10)

bootmean <- mean(bt.cor)    

Any assistance would be greatly appreciated, also the sooner the better as we are under pressure to reach a conclusion.  



        [[alternative HTML version deleted]]

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 01 Nov 2007 - 06:38:18 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 01 Nov 2007 - 20:30:10 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.