From: Daniel Folkinshteyn <dfolkins_at_gmail.com>

Date: Fri, 06 Jun 2008 13:29:56 -0400

>> Anybody have any thoughts on this? Please? :)

*>>
*

*>> on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:
*

*>>> Hi everyone!
*

*>>>
*

*>>> I have a question about data processing efficiency.
*

*>>>
*

*>>> My data are as follows: I have a data set on quarterly institutional
*

*>>> ownership of equities; some of them have had recent IPOs, some have
*

*>>> not (I have a binary flag set). The total dataset size is 700k+ rows.
*

*>>>
*

*>>> My goal is this: For every quarter since issue for each IPO, I need
*

*>>> to find a "matched" firm in the same industry, and close in market
*

*>>> cap. So, e.g., for firm X, which had an IPO, i need to find a matched
*

*>>> non-issuing firm in quarter 1 since IPO, then a (possibly different)
*

*>>> non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing
*

*>>> firm (there are about 8300 of these).
*

*>>>
*

*>>> Thus it seems to me that I need to be doing a lot of data selection
*

*>>> and subsetting, and looping (yikes!), but the result appears to be
*

*>>> highly inefficient and takes ages (well, many hours). What I am
*

*>>> doing, in pseudocode, is this:
*

*>>>
*

*>>> 1. for each quarter of data, getting out all the IPOs and all the
*

*>>> eligible non-issuing firms.
*

*>>> 2. for each IPO in a quarter, grab all the non-issuers in the same
*

*>>> industry, sort them by size, and finally grab a matching firm closest
*

*>>> in size (the exact procedure is to grab the closest bigger firm if
*

*>>> one exists, and just the biggest available if all are smaller)
*

*>>> 3. assign the matched firm-observation the same "quarters since
*

*>>> issue" as the IPO being matched
*

*>>> 4. rbind them all into the "matching" dataset.
*

*>>>
*

*>>> The function I currently have is pasted below, for your reference. Is
*

*>>> there any way to make it produce the same result but much faster?
*

*>>> Specifically, I am guessing eliminating some loops would be very
*

*>>> good, but I don't see how, since I need to do some fancy footwork for
*

*>>> each IPO in each quarter to find the matching firm. I'll be doing a
*

*>>> few things similar to this, so it's somewhat important to up the
*

*>>> efficiency of this. Maybe some of you R-fu masters can clue me in? :)
*

*>>>
*

*>>> I would appreciate any help, tips, tricks, tweaks, you name it! :)
*

*>>>
*

*>>> ========== my function below ===========
*

*>>>
*

*>>> fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,
*

*>>> quarters_since_issue=40) {
*

*>>>
*

*>>> result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is
*

*>>> cheaper, so typecast the result to matrix
*

*>>>
*

*>>> colnames = names(tfdata)
*

*>>>
*

*>>> quarterends = sort(unique(tfdata$DATE))
*

*>>>
*

*>>> for (aquarter in quarterends) {
*

*>>> tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
*

*>>>
*

*>>> tfdata_quarter_fitting_nonissuers = tfdata_quarter[
*

*>>> (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) &
*

*>>> (tfdata_quarter$IPO.Flag == 0), ]
*

*>>> tfdata_quarter_ipoissuers = tfdata_quarter[
*

*>>> tfdata_quarter$IPO.Flag == 1, ]
*

*>>>
*

*>>> for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
*

*>>> arow = tfdata_quarter_ipoissuers[i,]
*

*>>> industrypeers = tfdata_quarter_fitting_nonissuers[
*

*>>> tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
*

*>>> industrypeers = industrypeers[
*

*>>> order(industrypeers$Market.Cap.13f), ]
*

*>>> if ( nrow(industrypeers) > 0 ) {
*

*>>> if ( nrow(industrypeers[industrypeers$Market.Cap.13f
*

*>>> >= arow$Market.Cap.13f, ]) > 0 ) {
*

*>>> bestpeer =
*

*>>> industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ][1,]
*

*>>> }
*

*>>> else {
*

*>>> bestpeer = industrypeers[nrow(industrypeers),]
*

*>>> }
*

*>>> bestpeer$Quarters.Since.IPO.Issue =
*

*>>> arow$Quarters.Since.IPO.Issue
*

*>>>
*

*>>> #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==
*

*>>> bestpeer$PERMNO] = 1
*

*>>> result = rbind(result, as.matrix(bestpeer))
*

*>>> }
*

*>>> }
*

*>>> #result = rbind(result, tfdata_quarter)
*

*>>> print (aquarter)
*

*>>> }
*

*>>>
*

*>>> result = as.data.frame(result)
*

*>>> names(result) = colnames
*

*>>> return(result)
*

*>>>
*

*>>> }
*

*>>>
*

*>>> ========= end of my function =============
*

*>>>
*

*>>
*

*>> ______________________________________________
*

*>> R-help_at_r-project.org mailing list
*

*>> https://stat.ethz.ch/mailman/listinfo/r-help
*

*>> PLEASE do read the posting guide
*

*>> http://www.R-project.org/posting-guide.html
*

*>> and provide commented, minimal, self-contained, reproducible code.
*

*>>
*

*>>
*

>

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 06 Jun 2008 - 17:36:18 GMT

Date: Fri, 06 Jun 2008 13:29:56 -0400

thanks for the tip! i'll try that and see how big of a difference that makes... if i am not sure what exactly the size will be, am i better off making it larger, and then later stripping off the blank rows, or making it smaller, and appending the missing rows?

on 06/06/2008 11:44 AM Patrick Burns said the following:

> One thing that is likely to speed the code significantly > is if you create 'result' to be its final size and then > subscript into it. Something like: > > result[i, ] <- bestpeer > > (though I'm not sure if 'i' is the proper index). > > Patrick Burns > patrick_at_burns-stat.com > +44 (0)20 8525 0696 > http://www.burns-stat.com > (home of S Poetry and "A Guide for the Unwilling S User") > > Daniel Folkinshteyn wrote:

>> Anybody have any thoughts on this? Please? :)

>

R-help_at_r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 06 Jun 2008 - 17:36:18 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Fri 06 Jun 2008 - 19:30:39 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help.
Please read the posting
guide before posting to the list.
*