Re: [R] Still help needed on embeded regression

From: Guojun Zhu <shmilylemon_at_yahoo.com>
Date: Wed 03 May 2006 - 13:57:18 EST


For my perticular purpose (run regression on trailing 60 number b on 60 a), I have writen a short function for it because no available function (runmean, rollmean) can take care of NA as I wanted. I basically want the regression ignore the NA row. I have read a little bit source code of "runmean" and developed my code. I basically put zero for every NA row, as well as keep tracking the real number of regression for the least square fit (dummy). The whole program takes less than a about 0.2 seconds for the whole 144,000 data. I was amazed when I think about the 2+ hours for my original code. My friend with SAS needs about 1 minutes, he used to beat me by 100 times. Now I beat him by 100 times!

Here is the code. I am wondering if it will be useful for others. Actually I think this kind of calculation is fairly common at least in finance for things like "stock beta". Does it worht to wrap it and put it somewhere on web? Thanks.

window.sum<-function(x,k){
na.row<-is.na(x);x[na.row]<-0;
temp<-c(sum(x[1:k]),diff(x,k)); #print(temp) cumsum(temp)
}

##################################

# function to calculate linear regression of y on x
for the lag k,
# such as stock beta for the trailing k months.
# It will return a list of two vectors with length as
length(x)-k+1;
# ( intercept,coefficency)

lag.regression<-function(x,y,k){
l<-length(x);

###########################

# correctly taken care of the NA value, we basically
set all value in a row as 0
# if one of them is NA. By that we we find the
correct answer for regression
# with this row basically takes no effect.
na.row<-is.na(x)|is.na(y);
x[na.row]<-0; y[na.row]<-0;
dummy<-rep(1,l); dummy[na.row]<-0;
###############################

mxy<-window.sum(x*y,k);mx<-window.sum(x,k);
mxx<-window.sum(x*x,k);my<-window.sum(y,k); num<-window.sum(dummy,k);
b<-(num*mxy-mx*my)/(num*mxx-mx*mx);
a<-(my-b*mx)/num;

#########
#this take care of the case when all k ys or xs are 0 and yield a none sense result
na.row1<-is.na(a)|is.na(b);
a[na.row1]<-NA; b[na.row1]<-NA;
###############

list(a,b)
}

x<-RET-riskfree; y<-sprtrn-riskfree;
s<-lag.regression(y,x,60)

> Try
> 
> 	runmean2 <- function(x, k) # k must be even
> 		(coredata(runmean(x, k-1)) * (k-1) +
> 			coredata(lag(x, -k/2, na.pad = TRUE)))/k
> 
> Also, in your code use matrices or vectors instead
> of data frames
> to avoid any overhead in using data frames.
> 
> On 5/2/06, Guojun Zhu <shmilylemon@yahoo.com> wrote:
> > Sorry to bother you guys again. This is great. 
> But
> > this is for 61 number and the second case will
> change
> > 60 to 61.  "run*" only accept odd number window.
> How
> > to get around it with 60?  Any suggestion? Thanks.
> >
> > --- Gabor Grothendieck <ggrothendieck@gmail.com>
> > wrote:
> >
> > > Using runmean from caTools the first one below
> does
> > > it in under 1 second but will not handle NAs. 
> The
> > > second one takes under 15 seconds and handles
> > > them by replacing them with linear
> approximations.
> > > Note that k must be odd.
> > >
> > > # 1
> > >
> > > library(caTools)
> > > set.seed(1)
> > > system.time({
> > >       y <- rnorm(140001)
> > >       x <- as.numeric(seq(y))
> > >       k <- 61
> > >       Mxy <- runmean(x * y, k)
> > >       Mxx <- runmean(x * x, k)
> > >       Mx <- runmean(x, k)
> > >       My <- runmean(y, k)
> > >       b <- (Mxy - Mx * My) / (Mxx - Mx * Mx)
> > >       a <- My - b * Mx
> > > })
> > >
> > > # 2
> > >
> > > library(caTools)
> > > library(zoo)
> > > set.seed(1)
> > > system.time({
> > >       y <- rnorm(140000)
> > >       x <- as.numeric(seq(y))
> > >       x[100:200] <- NA
> > >       x <- na.approx(zoo(x))
> > >       y <- zoo(y)
> > >       k <- 60
> > >       Mxy <- runmean(x * y, k)
> > >       Mxx <- runmean(x * x, k)
> > >       Mx <- runmean(x, k)
> > >       My <- runmean(y, k)
> > >       b <- (Mxy - Mx * My) / (Mxx - Mx * Mx)
> > >       a <- My - b * Mx
> > > })
> > >
> > >
> > > On 5/1/06, Guojun Zhu <shmilylemon@yahoo.com>
> wrote:
> > > > I basically has a long data.frame a.  but I
> only
> > > need
> > > > three columns x,y. Let us say the index of row
> is
> > > t.
> > > > I need to produce new column s_t as the linear
> > > > regression coefficient of
> (x_(t-60),...x_(t-1)) on
> > > > (y_(t-60),...,y_(t-1)). The data is about
> 140,000
> > > > rows.  I wrote a simple code on this which is
> > > super
> > > > slow, it takes more than 2 hours on a 2.8Ghz
> Intel
> > > Duo
> > > > Core.  My friend use SAS and his code needs
> only
> > > > couple of minutes.  I know there must be some
> more
> > > > efficient way to write it.  Can anyone help me
> on
> > > > this?  Here is the code.
> > > >
> > > > Also one line produce a complete NA temp$y and
> lm
> > > > function failed on that.  How to make it just
> > > produce
> > > > a NA instead and keep runing?
> > > >
> > > > attach(return)
> > > > betat=rep(NA,length(RET))
> > > > for (i in 61:length(RET)){cat(i," ");
> > > > if (year[[i]]>=1995){
> > > >
> > > >
> > >
> >
>

temp<-data.frame(y=RET[(i-60):(i-1)]-riskfree[(i-60):(i-1)],x=sprtrn[(i-60):(i-1)]-riskfree[(i-60):(i-1)])
> > > >
> > > >
> > >
> >
>

betat[[i]]<-lm(y~x+1,na.action=na.exclude,temp)[[1]][[2]]
> > > >  #if (i%%100==0)
> > > > cat(i," ");
> > > >
> > > >
> > > >
> > >
> >
>

return$vol.cap[[i]]=mean(VOL[(i-12):(i-1)],na.rm=TRUE)/return$cap[[i]]
> > > > }
> > > > }
> > > >
> > > > ______________________________________________
> > > > R-help@stat.math.ethz.ch mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > > >
> > >
> >
> >
> > __________________________________________________
> > Do You Yahoo!?

> protection around
> > http://mail.yahoo.com

> >
>

R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Wed May 03 14:03:54 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Wed 03 May 2006 - 16:10:01 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.