Re: [R] Problem to generate training data set and test data set

From: Jim Lemon <jim_at_bitwrit.com.au>
Date: Tue 26 Dec 2006 - 00:16:20 GMT

Aimin Yan wrote:
> I have a full data set like this:
>
> aa bas aas bms ams bcu acu omega y
> 1 ALA 0 127.71 0 69.99 0 -0.2498560 79.91470 outward
> 2 PRO 0 68.55 0 55.44 0 -0.0949008 76.60380 outward
> 3 ALA 0 52.72 0 47.82 0 -0.0396550 52.19970 outward
> 4 PHE 0 22.62 0 31.21 0 0.1270330 169.52500 inward
> 5 SER 0 71.32 0 52.84 0 -0.1312380 7.47528 outward
> 6 VAL 0 12.92 0 22.40 0 0.1728390 149.09400 inward
> ......................................................................................
>
>
> aa have 19 levels, and there are different number of observation for each
> levels.
> I want to pick 75% of observations of each levels randomly to generate a
> training set,
> and 25% of observation of each levels to generate a testing set.
>
Hi Aimin,
I haven't tested this exhaustively, but I think it does what you want.

get.prob.sample<-function(x,prob=0.5) {
  xlevels<-levels(as.factor(x))
  xlength<-length(x)
  xsamp<-rep(FALSE,xlength)
  for(i in xlevels) {
   lengthi<-length(x[x == i])
   xsamp[sample(which(x == i),lengthi*prob)]<-TRUE   }
  return(xsamp)
}

get.prob.sample(mydata$aa,0.75)

Jim



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Tue Dec 26 11:19:06 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Tue 26 Dec 2006 - 01:30:24 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.