From: Guojun Zhu <shmilylemon_at_yahoo.com>

Date: Wed 03 May 2006 - 13:57:18 EST

# function to calculate linear regression of y on x

for the lag k,

# such as stock beta for the trailing k months.

# It will return a list of two vectors with length as

length(x)-k+1;

# ( intercept,coefficency)

# correctly taken care of the NA value, we basically

set all value in a row as 0

# if one of them is NA. By that we we find the

correct answer for regression

# with this row basically takes no effect.

na.row<-is.na(x)|is.na(y);

x[na.row]<-0; y[na.row]<-0;

dummy<-rep(1,l); dummy[na.row]<-0;

b<-(num*mxy-mx*my)/(num*mxx-mx*mx);

a<-(my-b*mx)/num;

temp<-data.frame(y=RET[(i-60):(i-1)]-riskfree[(i-60):(i-1)],x=sprtrn[(i-60):(i-1)]-riskfree[(i-60):(i-1)])

betat[[i]]<-lm(y~x+1,na.action=na.exclude,temp)[[1]][[2]]

return$vol.cap[[i]]=mean(VOL[(i-12):(i-1)],na.rm=TRUE)/return$cap[[i]]

> >

>

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Wed May 03 14:03:54 2006

Date: Wed 03 May 2006 - 13:57:18 EST

For my perticular purpose (run regression on trailing
60 number b on 60 a), I have writen a short function
for it because no available function (runmean,
rollmean) can take care of NA as I wanted. I
basically want the regression ignore the NA row. I
have read a little bit source code of "runmean" and
developed my code. I basically put zero for every NA
row, as well as keep tracking the real number of
regression for the least square fit (dummy). The
whole program takes less than a about 0.2 seconds for
the whole 144,000 data. I was amazed when I think
about the 2+ hours for my original code. My friend
with SAS needs about 1 minutes, he used to beat me by
100 times. Now I beat him by 100 times!

Here is the code. I am wondering if it will be useful for others. Actually I think this kind of calculation is fairly common at least in finance for things like "stock beta". Does it worht to wrap it and put it somewhere on web? Thanks.

window.sum<-function(x,k){

na.row<-is.na(x);x[na.row]<-0;

temp<-c(sum(x[1:k]),diff(x,k)); #print(temp)
cumsum(temp)

}

##################################

# function to calculate linear regression of y on x

for the lag k,

# such as stock beta for the trailing k months.

# It will return a list of two vectors with length as

length(x)-k+1;

# ( intercept,coefficency)

lag.regression<-function(x,y,k){

l<-length(x);

###########################

# correctly taken care of the NA value, we basically

set all value in a row as 0

# if one of them is NA. By that we we find the

correct answer for regression

# with this row basically takes no effect.

na.row<-is.na(x)|is.na(y);

x[na.row]<-0; y[na.row]<-0;

dummy<-rep(1,l); dummy[na.row]<-0;

############################### mxy<-window.sum(x*y,k);mx<-window.sum(x,k);mxx<-window.sum(x*x,k);my<-window.sum(y,k); num<-window.sum(dummy,k);

b<-(num*mxy-mx*my)/(num*mxx-mx*mx);

a<-(my-b*mx)/num;

#########

#this take care of the case when all k ys or xs are 0
and yield a none sense result

na.row1<-is.na(a)|is.na(b);

a[na.row1]<-NA; b[na.row1]<-NA;

###############

list(a,b)

}

x<-RET-riskfree; y<-sprtrn-riskfree;

s<-lag.regression(y,x,60)

- Gabor Grothendieck <ggrothendieck@gmail.com> wrote:

> Try > > runmean2 <- function(x, k) # k must be even > (coredata(runmean(x, k-1)) * (k-1) + > coredata(lag(x, -k/2, na.pad = TRUE)))/k > > Also, in your code use matrices or vectors instead > of data frames > to avoid any overhead in using data frames. > > On 5/2/06, Guojun Zhu <shmilylemon@yahoo.com> wrote: > > Sorry to bother you guys again. This is great. > But > > this is for 61 number and the second case will > change > > 60 to 61. "run*" only accept odd number window. > How > > to get around it with 60? Any suggestion? Thanks. > > > > --- Gabor Grothendieck <ggrothendieck@gmail.com> > > wrote: > > > > > Using runmean from caTools the first one below > does > > > it in under 1 second but will not handle NAs. > The > > > second one takes under 15 seconds and handles > > > them by replacing them with linear > approximations. > > > Note that k must be odd. > > > > > > # 1 > > > > > > library(caTools) > > > set.seed(1) > > > system.time({ > > > y <- rnorm(140001) > > > x <- as.numeric(seq(y)) > > > k <- 61 > > > Mxy <- runmean(x * y, k) > > > Mxx <- runmean(x * x, k) > > > Mx <- runmean(x, k) > > > My <- runmean(y, k) > > > b <- (Mxy - Mx * My) / (Mxx - Mx * Mx) > > > a <- My - b * Mx > > > }) > > > > > > # 2 > > > > > > library(caTools) > > > library(zoo) > > > set.seed(1) > > > system.time({ > > > y <- rnorm(140000) > > > x <- as.numeric(seq(y)) > > > x[100:200] <- NA > > > x <- na.approx(zoo(x)) > > > y <- zoo(y) > > > k <- 60 > > > Mxy <- runmean(x * y, k) > > > Mxx <- runmean(x * x, k) > > > Mx <- runmean(x, k) > > > My <- runmean(y, k) > > > b <- (Mxy - Mx * My) / (Mxx - Mx * Mx) > > > a <- My - b * Mx > > > }) > > > > > > > > > On 5/1/06, Guojun Zhu <shmilylemon@yahoo.com> > wrote: > > > > I basically has a long data.frame a. but I > only > > > need > > > > three columns x,y. Let us say the index of row > is > > > t. > > > > I need to produce new column s_t as the linear > > > > regression coefficient of > (x_(t-60),...x_(t-1)) on > > > > (y_(t-60),...,y_(t-1)). The data is about > 140,000 > > > > rows. I wrote a simple code on this which is > > > super > > > > slow, it takes more than 2 hours on a 2.8Ghz > Intel > > > Duo > > > > Core. My friend use SAS and his code needs > only > > > > couple of minutes. I know there must be some > more > > > > efficient way to write it. Can anyone help me > on > > > > this? Here is the code. > > > > > > > > Also one line produce a complete NA temp$y and > lm > > > > function failed on that. How to make it just > > > produce > > > > a NA instead and keep runing? > > > > > > > > attach(return) > > > > betat=rep(NA,length(RET)) > > > > for (i in 61:length(RET)){cat(i," "); > > > > if (year[[i]]>=1995){ > > > > > > > > > > > > > >

temp<-data.frame(y=RET[(i-60):(i-1)]-riskfree[(i-60):(i-1)],x=sprtrn[(i-60):(i-1)]-riskfree[(i-60):(i-1)])

> > > > > > > > > > > > > >

betat[[i]]<-lm(y~x+1,na.action=na.exclude,temp)[[1]][[2]]

> > > > #if (i%%100==0) > > > > cat(i," "); > > > > > > > > > > > > > > > > > >

return$vol.cap[[i]]=mean(VOL[(i-12):(i-1)],na.rm=TRUE)/return$cap[[i]]

> > > > } > > > > } > > > > > > > > ______________________________________________ > > > > R-help@stat.math.ethz.ch mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide! > > > http://www.R-project.org/posting-guide.html > > > > > > > > > > > > > __________________________________________________ > > Do You Yahoo!? > protection around > > http://mail.yahoo.com

> >

>

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Wed May 03 14:03:54 2006

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.1.8, at Wed 03 May 2006 - 16:10:01 EST.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*