Re: [R] Improving data processing efficiency

From: Daniel Folkinshteyn <dfolkins_at_gmail.com>
Date: Fri, 06 Jun 2008 12:03:13 -0400

i did! what did i miss?

on 06/06/2008 11:45 AM Gabor Grothendieck said the following:

> Try reading the posting guide before posting.
> 
> On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn <dfolkins_at_gmail.com> wrote:

>> Anybody have any thoughts on this? Please? :)
>>
>> on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:
>>> Hi everyone!
>>>
>>> I have a question about data processing efficiency.
>>>
>>> My data are as follows: I have a data set on quarterly institutional
>>> ownership of equities; some of them have had recent IPOs, some have not (I
>>> have a binary flag set). The total dataset size is 700k+ rows.
>>>
>>> My goal is this: For every quarter since issue for each IPO, I need to
>>> find a "matched" firm in the same industry, and close in market cap. So,
>>> e.g., for firm X, which had an IPO, i need to find a matched non-issuing
>>> firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in
>>> quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300
>>> of these).
>>>
>>> Thus it seems to me that I need to be doing a lot of data selection and
>>> subsetting, and looping (yikes!), but the result appears to be highly
>>> inefficient and takes ages (well, many hours). What I am doing, in
>>> pseudocode, is this:
>>>
>>> 1. for each quarter of data, getting out all the IPOs and all the eligible
>>> non-issuing firms.
>>> 2. for each IPO in a quarter, grab all the non-issuers in the same
>>> industry, sort them by size, and finally grab a matching firm closest in
>>> size (the exact procedure is to grab the closest bigger firm if one exists,
>>> and just the biggest available if all are smaller)
>>> 3. assign the matched firm-observation the same "quarters since issue" as
>>> the IPO being matched
>>> 4. rbind them all into the "matching" dataset.
>>>
>>> The function I currently have is pasted below, for your reference. Is
>>> there any way to make it produce the same result but much faster?
>>> Specifically, I am guessing eliminating some loops would be very good, but I
>>> don't see how, since I need to do some fancy footwork for each IPO in each
>>> quarter to find the matching firm. I'll be doing a few things similar to
>>> this, so it's somewhat important to up the efficiency of this. Maybe some of
>>> you R-fu masters can clue me in? :)
>>>
>>> I would appreciate any help, tips, tricks, tweaks, you name it! :)
>>>
>>> ========== my function below ===========
>>>
>>> fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,
>>> quarters_since_issue=40) {
>>>
>>> result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is
>>> cheaper, so typecast the result to matrix
>>>
>>> colnames = names(tfdata)
>>>
>>> quarterends = sort(unique(tfdata$DATE))
>>>
>>> for (aquarter in quarterends) {
>>> tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
>>>
>>> tfdata_quarter_fitting_nonissuers = tfdata_quarter[
>>> (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) &
>>> (tfdata_quarter$IPO.Flag == 0), ]
>>> tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag
>>> == 1, ]
>>>
>>> for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
>>> arow = tfdata_quarter_ipoissuers[i,]
>>> industrypeers = tfdata_quarter_fitting_nonissuers[
>>> tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
>>> industrypeers = industrypeers[
>>> order(industrypeers$Market.Cap.13f), ]
>>> if ( nrow(industrypeers) > 0 ) {
>>> if ( nrow(industrypeers[industrypeers$Market.Cap.13f >=
>>> arow$Market.Cap.13f, ]) > 0 ) {
>>> bestpeer = industrypeers[industrypeers$Market.Cap.13f
>>>> = arow$Market.Cap.13f, ][1,]
>>> }
>>> else {
>>> bestpeer = industrypeers[nrow(industrypeers),]
>>> }
>>> bestpeer$Quarters.Since.IPO.Issue =
>>> arow$Quarters.Since.IPO.Issue
>>>
>>> #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==
>>> bestpeer$PERMNO] = 1
>>> result = rbind(result, as.matrix(bestpeer))
>>> }
>>> }
>>> #result = rbind(result, tfdata_quarter)
>>> print (aquarter)
>>> }
>>>
>>> result = as.data.frame(result)
>>> names(result) = colnames
>>> return(result)
>>>
>>> }
>>>
>>> ========= end of my function =============
>>>
>> ______________________________________________
>> R-help_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 06 Jun 2008 - 16:33:53 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 06 Jun 2008 - 18:30:36 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive