[R] need any advises for code optimization.

From: Wladimir Eremeev <wl_at_eimb.ru>
Date: Mon 04 Apr 2005 - 16:50:48 EST


Dear colleagues,

  I have the following code. This code is to 'filter' the data set.

  It works on the data frame 'whole' with four numeric columns: a,b,d, and c.   Every row in the data frame is considered as a point in 3-D space.   Variables a,b, and d are the point's coordinates, and c is its value.   This code looks at every point, builds a cube 'centered' at this   point, selects the set of points inside this cube,   calculates mean and SD of their values,   and drops points whose values differ from the mean more than 2 SD.

  Here is the code.   



# initialization
cube.half.size<-2 # half size of a cube to be built around every point
mult.sigma<-2        # we will drop every point with value differing
                     # from mean more than mult.sigma * SD

to.drop<-data.frame() # the list of points to drop.

for(i in 1:length(whole$c)){ #look at every point...

  pv<-subset(whole,abs(a-whole$a[i])<cube.half.size &   #make the subset...
                   abs(b-whole$b[i])<cube.half.size &
                   abs(d-whole$d[i])<cube.half.size);
  if(length(pv$c)>1){ # if subset includes not only considered point, then     mean.c<-mean(pv$c) # calculate mean and SD     sd.c<-sd(pv$c)

#make a list of points to drop from current subset

    td<-subset(pv,abs(c-mean.c)>sd.c*mult.sigma)     if(length(td$c)>0){     

#check which of these point are already already in the list to drop

      td.index<-which(row.names(td) %in% row.names(to.drop))
      

#and replenish the list of points to drop
to.drop<-rbind(to.drop,if(length(td.index)>0) td[-td.index,] else td)

#print out the message showing, we're alive (these messages will
#not appear regularly, that's OK)

      if(length(td.index)!=length(td$c))
        print(c("i=",i,"Points to drop: ",length(to.drop$c)))
    }
  }
}

# make a new data set without droppped points. whole.flt.3<-whole[-which(row.names(to.drop) %in% row.names(whole)),]


  

  The problem is: the 'whole' data set is large, more than 100000   rows, and the script runs several hours.   The running time becomes greater, if I build a sphere instead of a   cube.

  I would like to optimize it in order to make it run faster.   Is it possible?
  Will a sorting take effect?
  Thank you for attention and any good feedback.

--
Best regards
Wladimir Eremeev                                     mailto:wl@eimb.ru

==========================================================================
Research Scientist, PhD
Russian Academy of Sciences

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Mon Apr 04 16:54:48 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:31:01 EST