Re: [R] efficiency in merging two data frames

From: Steve Miller <steve.miller_at_jhu.edu>
Date: Mon 01 May 2006 - 22:54:54 EST


I'm sure you'll get ingenious responses to help you optimize your R code. I deal with similar investment data in even larger numbers (e.g. 10 years of daily return data for each stock in the Russell 3000), and prefer reading and consolidating the data in Python using dictionaries and lists, then either piping the data to R in a read statement (read.table("pipe python...")) or using Rpy to write R data frames directly from Python. Python is more facile with these basic data manipulations for hundreds of thousands or even millions of records, and performance is generally considerably better.

Steve Miller

-----Original Message-----
From: r-help-bounces@stat.math.ethz.ch
[mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Guojun Zhu Sent: Monday, May 01, 2006 2:35 AM
To: r-help@stat.math.ethz.ch
Subject: [R] efficiency in merging two data frames

I have two data sets about lots of companies' stock and fiscal data. One is monthly data with about 144,000 lines, and the other is quaterly with about 56,000. Each data set takes different company code. I need to merge these two together. I read both ask cvs. And the other file with corresponding firm code.  Now I have three data sets. return$PERMNO, account$GVKEY. id is the data frames of the corresponding relation and has both id$PERMNO and id$GVKEY. Also, I need to convert the return's month into quarter and finally merge two data frames(return and account). I end up write a short program for this, but it runs very slow. 15+ minutes. Is there quick way to do it. Here is my original codes.

id$fy=rep(0,length(id$PERMNO))
for (i in 1:length(id$PERMNO))

id$fy[[i]]<-account$FYR[id$GVKEY[[i]]==account$GVKEY][[1]]

return$GVKEY=rep(0,length(return$PERMNO))
return$fyy=rep(0,length(return$PERMNO))
return$fyq=rep(0,length(return$PERMNO))

for (i in i:length(return$PERMNO)) {

    temp<-id$PERMNO==return$PERMNO[[i]];     tempmon<-id$fy[temp][[1]];
    if (return$month[[i]]<-tempmon) {

	return$fyy[[i]]<-return$year[[i]];
	return$fyq[[i]]<-4-(tempmon-return$month[[i]])%/%3;

}
else{ return$fyy[[i]]<-return$year[[i]]+1; return$fyq[[i]]<-(return$month[[i]]-tempmon-1)%/%3;
}

    return$GVKEY[[i]]<-id$GVKEY[temp][[1]]; }    

returnnew=merge(return,account,by.x<-c("GVKEY","fyy","fyq"),by.y<-c("GVKEY", "fyy","fyq"))



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Mon May 01 23:00:40 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Tue 02 May 2006 - 22:09:56 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.