Re: [R] Improving data processing efficiency

From: Charles C. Berry <>
Date: Fri, 06 Jun 2008 17:23:32 -0700

On Fri, 6 Jun 2008, Daniel Folkinshteyn wrote:

>> install.packages("profr")
>> library(profr)
>> p <- profr(fcn_create_nonissuing_match_by_quarterssinceissue(...))
>> plot(p)
>> That should at least help you see where the slow bits are.
>> Hadley

> so profiling reveals that '[.data.frame' and '[[.data.frame' and '[' are the
> biggest timesuckers...
> i suppose i'll try using matrices and see how that stacks up (since all my
> cols are numeric, should be a problem-free approach).
> but i'm really wondering if there isn't some neat vectorized approach i could
> use to avoid at least one of the nested loops...

As far as a vectorized solution, I'll bet you could do ALL the lookups of non-issuers for all issuers with a single call to findInterval() (modulo some cleanup afterwards) , but the trickery needed to do that would make your code a bit opaque.

And in the end I doubt it would beat mapply() (read on...) by enough to make it worthwhile.


What you are doing is conditional on industry group and quarter.

So using

 	indus.quarter <- with(tfdat,
 		paste(as.character(DATE), as.character(HSICIG), sep=".")))

and then calls like this:

 	split( <various> , indus.quater[ relevant.subset ] )

you can create:

 	a list of all issuer market caps according to quarter and group,

 	a list of all non-issuer caps (that satisfy your 'since quarter'
 	restriction) according to quarter and group,

 	a list of all non issuer indexes (i.e. row numbers) that satisfy
 	that restriction according to quarter and group

Then you write a function that takes the elements of each list for a given 
quarter-industry group, looks up the matching non-issuers for each issuer, 
and returns their indexes.

findInterval() will allow you to do this lookup for all issuers in one 
industry group in a given quarter simultaneously and greatly speed this 
process (but you will need to deal with the possible non-uniqueness of the 
non-issuer caps - perhaps by adding a tiny jitter() to the values).

Then you feed the function and the lists to mapply().

The result is a list of indexes on the original data.frame. You can 
unsplit() this if you like, then use those indexes to build your final 
"result" data.frame.



p.s. and if this all seems like too much work, you should at least avoid 
needlessly creating data.frames. Specifically

reorder things so that

 	   industrypeers = <etc>

is only done ONCE for each industry group by quarter combination and 
change stuff like

nrow(industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ]) > 0


any( industrypeers$Market.Cap.13f >= arow$Market.Cap.13f )

> ______________________________________________
> mailing list
> PLEASE do read the posting guide
> and provide commented, minimal, self-contained, reproducible code.
Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E UC San Diego La Jolla, San Diego 92093-0901 ______________________________________________ mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code.
Received on Sat 07 Jun 2008 - 00:30:07 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 07 Jun 2008 - 03:30:39 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.

list of date sections of archive