Re: [R] Improving data processing efficiency

From: Daniel Folkinshteyn <dfolkins_at_gmail.com>
Date: Fri, 06 Jun 2008 13:57:39 -0400

Ok, sorry about the zip, then. :) Thanks for taking the trouble to clue me in as to the best posting procedure!

well, here's a dput-ed version of the small data subset you can use for testing. below that, an updated version of the function, with extra explanatory comments, and producing an extra column showing exactly what is matched to what.

to test, just run the function, with the dataset as sole argument.

Thanks again; i'd appreciate any input on this.

structure(list(PERMNO = c(10001L, 10001L, 10298L, 10298L, 10484L,

10484L, 10515L, 10515L, 10634L, 10634L, 11048L, 11048L, 11237L,
11294L, 11294L, 11338L, 11338L, 11404L, 11404L, 11587L, 11587L,
11591L, 11591L, 11737L, 11737L, 11791L, 11809L, 11809L, 11858L,
11858L, 11955L, 11955L, 12003L, 12003L, 12016L, 12016L, 12223L,
12223L, 12758L, 12758L, 13688L, 13688L, 16117L, 16117L, 17770L,
17770L, 21514L, 21514L, 21792L, 21792L, 21821L, 21821L, 22437L,
22437L, 22947L, 22947L, 23027L, 23027L, 23182L, 23182L, 23536L,
23536L, 23712L, 23712L, 24053L, 24053L, 24117L, 24117L, 24256L,
24256L, 24299L, 24299L, 24352L, 24352L, 24379L, 24379L, 24467L,
24467L, 24679L, 24679L, 24870L, 24870L, 25056L, 25056L, 25208L,
25208L, 25232L, 25232L, 25241L, 25590L, 25590L, 26463L, 26463L,
26470L, 26470L, 26614L, 26614L, 27385L, 27385L, 29196L, 29196L,
30411L, 30411L, 32943L, 32943L, 38893L, 38893L, 40708L, 40708L,
41005L, 41005L, 42817L, 42817L, 42833L, 42833L, 43668L, 43668L,
45947L, 45947L, 46017L, 46017L, 48274L, 48274L, 49971L, 49971L,
53786L, 53786L, 53859L, 53859L, 54199L, 54199L, 56371L, 56952L,
56952L, 57277L, 57277L, 57381L, 57381L, 58202L, 58202L, 59395L,
59395L, 59935L, 60169L, 60169L, 61188L, 61188L, 61444L, 61444L,
62690L, 62690L, 62842L, 62842L, 64290L, 64290L, 64418L, 64418L,
64450L, 64450L, 64477L, 64477L, 64557L, 64557L, 64646L, 64646L,
64902L, 64902L, 67774L, 67774L, 68910L, 68910L, 70471L, 70471L,
74406L, 74406L, 75091L, 75091L, 75304L, 75304L, 75743L, 75964L,
75964L, 76026L, 76026L, 76162L, 76170L, 76173L, 78530L, 78530L,
78682L, 78682L, 81569L, 81569L, 82502L, 82502L, 83337L, 83337L,
83919L, 83919L, 88242L, 88242L, 90852L, 90852L, 91353L, 91353L
), DATE = c(19900331, 19900630, 19900630, 19900331, 19900331,
19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630,
19900630, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630,
19900331, 19900630, 19900630, 19900331, 19900630, 19900331, 19900630,
19900630, 19900331, 19900630, 19900331, 19900331, 19900630, 19900630,
19900331, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630,
19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900630,
19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900331,
19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900630,
19900331, 19900331, 19900630, 19900331, 19900630, 19900630, 19900331,
19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331,
19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630,
19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331,
19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900331,
19900630, 19900630, 19900331, 19900331, 19900630, 19900331, 19900630,
19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900331,
19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630,
19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331,
19900630, 19900331, 19900630, 19900630, 19900331, 19900331, 19900630,
19900630, 19900331, 19900630, 19900630, 19900331, 19900630, 19900331,
19900630, 19900331, 19900331, 19900630, 19900331, 19900331, 19900630,
19900331, 19900630, 19900331, 19900630, 19900630, 19900331, 19900331,
19900630, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630,
19900331, 19900630, 19900630, 19900331, 19900331, 19900630, 19900331,
19900630, 19900630, 19900331, 19900630, 19900331, 19900331, 19900630,
19900630, 19900331, 19900331, 19900630, 19900331, 19900630, 19900630,
19900331, 19900630, 19900331, 19900630, 19900630, 19900630, 19900630,
19900331, 19900630, 19900331, 19900630, 19900331, 19900630, 19900630,
19900331, 19900331, 19900630, 19900331, 19900630, 19900331, 19900630,
19900331, 19900630, 19900331, 19900630), Shares.Owned = c(50100,
50100, 250000, 293500, 3656629, 3827119, 4132439, 3566591, 2631193, 2500301, 775879, 816879, 38700, 1041600, 1070300, 533768, 558815, 61384492, 60466567, 194595, 196979, 359946, 314446, 106770, 107070, 20242, 1935098, 2099403, 1902125, 1766750, 41991, 41991, 34490, 36290, 589400, 596700, 1549395, 1759440, 854473, 762903, 156366785,
98780287, 2486389, 2635718, 122264, 122292, 25455916, 25458658,
71645490, 71855722, 30969596, 30409838, 2738576, 2814490, 20846605,
20930233, 1148299, 505415, 396388, 385714, 25239923, 24117950,
73465526, 73084616, 8096614, 7595742, 3937930, 3820215, 20884821,
19456342, 2127331, 2188276, 2334515, 2813347, 8267264, 8544084,
783277, 810742, 742048, 512956, 9659658, 9436873, 40107717, 41234384, 9111755, 9708782, 12815719, 13144148, 1146100, 8292392, 8271030, 282650, 281273, 4196126, 4273758, 2489363, 2734182, 1579681, 1369192, 51947585, 51941430, 54673, 52585, 317601, 314876, 62626258, 63341772, 8977553, 8940106, 4478872, 4315631, 1246339, 1227442, 68484747, 68041081, 22679902, 21775270, 927147, 936881, 2626449, 2245552, 14029366, 14304855, 2434123, 2184358, 77479654, 81754241, 333070, 282967, 241146, 256146, 11419, 819092, 798490, 1403179,
1326018, 238974451, 237684105, 1889699, 2317096, 4887641, 5972387,
3567239, 1024595, 993627, 5254732, 5459404, 413146, 432697, 5307595,
4813261, 7717872, 8689444, 2431341, 2372096, 909359, 868068,
2110670, 2055349, 23774859, 23573345, 4234466, 4143534, 1192314,
1255105, 3052000, 2605700, 5566270, 5972761, 1470173, 1448403,
28065345, 32961737, 1844441, 2247991, 651758, 655658, 65864806, 82392617, 1942906, 14800, 14657, 6600, 5534, 394064, 163000,
2499320, 1123624, 1227987, 198000, 241000, 3681688, 3409586,
2416988, 2407798, 55081, 48091, 480000, 785710, 1040147, 1171854,
1363994, 1555229, 199237, 192637), Shares.Outstanding.13f = c(1,
1, 7, 7, 8, 8, 8, 8, 6, 6, 8, 8, 4, 4, 4, 18, 19, 228, 228, 2,
2, 3, 3, 5, 5, 7, 9, 9, 6, 6, 2, 2, 3, 3, 7, 7, 14, 15, 3, 3,
429, 429, 17, 16, 2, 2, 43, 41, 127, 126, 86, 86, 15, 15, 51,
51, 7, 7, 3, 3, 67, 67, 211, 211, 35, 35, 14, 14, 49, 49, 12,
12, 22, 22, 31, 31, 4, 4, 4, 5, 34, 34, 64, 64, 56, 56, 27, 27,
47, 28, 28, 2, 2, 10, 10, 8, 8, 13, 13, 87, 87, 1, 1, 3, 3, 101,
101, 38, 36, 49, 56, 22, 22, 245, 247, 36, 35, 6, 6, 22, 22, 30, 30, 11, 11, 151, 151, 2, 2, 3, 3, 4, 4, 4, 10, 10, 468, 459,
10, 10, 16, 16, 27, 8, 8, 19, 19, 3, 3, 7, 7, 15, 15, 6, 6, 6,
6, 13, 13, 60, 60, 11, 11, 10, 10, 8, 8, 153, 152, 7, 7, 206,
206, 5, 5, 4, 4, 246, 299, 4, 0, 0, 13, 13, 7, 5, 10, 7, 7, 11,
11, 16, 16, 6, 6, 1, 1, 7, 7, 10, 10, 5, 5, 10, 10), Percent.Inst.Owned = c(0.0501,
0.0501, 0.0357142857142857, 0.0419285714285714, 0.457078625,
0.478389875, 0.516554875, 0.445823875, 0.438532166666667, 
0.416716833333333,
0.096984875, 0.102109875, 0.009675, 0.2604, 0.267575, 0.0296537777777778,
0.0294113157894737, 0.269230228070175, 0.26520424122807, 0.0972975,
0.0984895, 0.119982, 0.104815333333333, 0.021354, 0.021414, 
0.00289171428571429,
0.215010888888889, 0.233267, 0.317020833333333, 0.294458333333333,
0.0209955, 0.0209955, 0.0114966666666667, 0.0120966666666667,
0.0842, 0.0852428571428571, 0.110671071428571, 0.117296, 0.284824333333333,
0.254301, 0.36449134032634, 0.230257079254079, 0.146258176470588,
0.164732375, 0.061132, 0.061146, 0.591998046511628, 0.62094287804878,
0.564137716535433, 0.570283507936508, 0.360111581395349, 0.353602767441860,
0.182571733333333, 0.187632666666667, 0.408756960784314, 0.410396725490196,
0.164042714285714, 0.0722021428571429, 0.132129333333333, 
0.128571333333333,
0.376715268656716, 0.359969402985075, 0.348177848341232, 0.346372587677725,
0.231331828571429, 0.2170212, 0.281280714285714, 0.2728725, 
0.426220836734694,
0.397068204081633, 0.177277583333333, 0.182356333333333, 0.106114318181818,
0.127879409090909, 0.266685935483871, 0.275615612903226, 0.19581925,
0.2026855, 0.185512, 0.1025912, 0.284107588235294, 0.277555088235294,
0.626683078125, 0.64428725, 0.162709910714286, 0.173371107142857,
0.474656259259259, 0.486820296296296, 0.0243851063829787, 
0.296156857142857,
0.295393928571429, 0.141325, 0.1406365, 0.4196126, 0.4273758,
0.311170375, 0.34177275, 0.121513923076923, 0.105322461538462,
0.59709867816092, 0.597027931034483, 0.054673, 0.052585, 0.105867,
0.104958666666667, 0.62006196039604, 0.627146257425743, 0.236251394736842,
0.248336277777778, 0.0914055510204082, 0.0770648392857143, 
0.0566517727272727,
0.0557928181818182, 0.279529579591837, 0.275469963562753, 
0.629997277777778,
0.622150571428571, 0.1545245, 0.156146833333333, 0.119384045454545,
0.102070545454545, 0.467645533333333, 0.4768285, 0.221283909090909,
0.198578, 0.513110291390729, 0.541418814569536, 0.166535, 0.1414835,
0.080382, 0.085382, 0.00285475, 0.204773, 0.1996225, 0.1403179,
0.1326018, 0.510629168803419, 0.517830294117647, 0.1889699, 0.2317096,
0.3054775625, 0.3732741875, 0.132119962962963, 0.128074375, 0.124203375,
0.276564842105263, 0.287337052631579, 0.137715333333333, 0.144232333333333,
0.758227857142857, 0.687608714285714, 0.5145248, 0.579296266666667,
0.4052235, 0.395349333333333, 0.151559833333333, 0.144678, 
0.162359230769231,
0.158103769230769, 0.39624765, 0.392889083333333, 0.384951454545455,
0.376684909090909, 0.1192314, 0.1255105, 0.3815, 0.3257125, 
0.0363808496732026,
0.0392944802631579, 0.210024714285714, 0.206914714285714, 
0.136239538834951,
0.160008432038835, 0.3688882, 0.4495982, 0.1629395, 0.1639145,
0.267743113821138, 0.275560591973244, 0.4857265, Inf, Inf, 
0.000507692307692308,
0.000425692307692308, 0.0562948571428571, 0.0326, 0.249932, 
0.160517714285714,
0.175426714285714, 0.018, 0.0219090909090909, 0.2301055, 0.213099125,
0.402831333333333, 0.401299666666667, 0.055081, 0.048091, 
0.0685714285714286,
0.112244285714286, 0.1040147, 0.1171854, 0.2727988, 0.3110458,
0.0199237, 0.0192637), Latest.Issue.Date.ByPERMNO = c(19860108,
19860108, 19600101, 19600101, 19600101, 19600101, 19870728, 19870728,
19870501, 19870501, 19870805, 19870805, 19600101, 19600101, 19600101,
19600101, 19600101, 19730523, 19730523, 19600101, 19600101, 19870811,
19870811, 19870930, 19870930, 19600101, 19880729, 19880729, 19880225,
19880225, 19880602, 19880602, 19860610, 19860610, 19880802, 19880802,
19890629, 19890629, 19600101, 19600101, 19821109, 19821109, 19860619,
19860619, 19871117, 19871117, 19600101, 19600101, 19890308, 19890308,
19900208, 19900208, 19861120, 19861120, 19880803, 19880803, 19600101,
19600101, 19890216, 19890216, 19761202, 19761202, 19890919, 19890919,
19810623, 19810623, 19770615, 19770615, 19831004, 19831004, 19830616,
19830616, 19810519, 19810519, 19850311, 19850311, 19781130, 19781130,
19841016, 19900515, 19800904, 19800904, 19830825, 19830825, 19830601,
19830601, 19811110, 19811110, 19600101, 19890309, 19890309, 19850529,
19850529, 19881122, 19881122, 19840620, 19840620, 19740305, 19740305,
19860718, 19860718, 19600101, 19600101, 19860207, 19860207, 19891003,
19891003, 19870403, 19870403, 19600101, 19600101, 19790403, 19790403,
19850528, 19850528, 19830322, 19830322, 19761202, 19761202, 19841114,
19841114, 19800826, 19800826, 19880517, 19880517, 19860516, 19860516,
19891122, 19891122, 19600101, 19600101, 19600101, 19871119, 19871119,
19760624, 19760624, 19851206, 19851206, 19890615, 19890615, 19860805,
19860805, 19600101, 19890919, 19890919, 19860501, 19860501, 19600101,
19600101, 19890308, 19890308, 19900125, 19900125, 19890714, 19890714,
19880412, 19880412, 19890809, 19890809, 19870306, 19870306, 19751112,
19751112, 19870604, 19870604, 19810625, 19810625, 19600101, 19600101,
19860416, 19860416, 19891027, 19891027, 19890125, 19890125, 19860502,
19860502, 19600101, 19600101, 19900405, 19600101, 19600101, 19600101,
19600101, 19900412, 19900514, 19900518, 19890518, 19890518, 19600101,
19600101, 19900117, 19900117, 19891214, 19891214, 19600101, 19600101, 19600101, 19600101, 19851206, 19851206, 19851211, 19851211, 19600101, 19600101), Quarters.Since.19800101 = c(41L, 42L, 42L, 41L, 41L,
42L, 41L, 42L, 41L, 42L, 41L, 42L, 42L, 42L, 41L, 41L, 42L, 41L,
42L, 41L, 42L, 42L, 41L, 42L, 41L, 42L, 42L, 41L, 42L, 41L, 41L,
42L, 42L, 41L, 42L, 41L, 41L, 42L, 41L, 42L, 41L, 42L, 41L, 42L,
41L, 42L, 42L, 41L, 42L, 41L, 42L, 41L, 42L, 41L, 41L, 42L, 42L,
41L, 41L, 42L, 42L, 41L, 41L, 42L, 41L, 42L, 42L, 41L, 41L, 42L,
41L, 42L, 42L, 41L, 41L, 42L, 41L, 42L, 41L, 42L, 41L, 42L, 41L,
42L, 41L, 42L, 42L, 41L, 41L, 41L, 42L, 42L, 41L, 41L, 42L, 41L,
42L, 42L, 41L, 41L, 42L, 41L, 42L, 42L, 41L, 42L, 41L, 42L, 41L,
41L, 42L, 41L, 42L, 41L, 42L, 41L, 42L, 41L, 42L, 41L, 42L, 42L,
41L, 41L, 42L, 41L, 42L, 42L, 41L, 41L, 42L, 42L, 41L, 42L, 42L,
41L, 42L, 41L, 42L, 41L, 41L, 42L, 41L, 41L, 42L, 41L, 42L, 41L,
42L, 42L, 41L, 41L, 42L, 41L, 42L, 41L, 42L, 41L, 42L, 41L, 42L,
42L, 41L, 41L, 42L, 41L, 42L, 42L, 41L, 42L, 41L, 41L, 42L, 42L,
41L, 41L, 42L, 41L, 42L, 42L, 41L, 42L, 41L, 42L, 42L, 42L, 42L,
41L, 42L, 41L, 42L, 41L, 42L, 42L, 41L, 41L, 42L, 41L, 42L, 41L,
42L, 41L, 42L, 41L, 42L), Quarters.Since.Latest.Issue = c(17L,
18L, 122L, 121L, 121L, 122L, 11L, 12L, 12L, 13L, 11L, 12L, 122L,
122L, 121L, 121L, 122L, 68L, 69L, 121L, 122L, 12L, 11L, 11L,
10L, 122L, 8L, 7L, 10L, 9L, 8L, 9L, 17L, 16L, 8L, 7L, 4L, 5L,
121L, 122L, 30L, 31L, 16L, 17L, 10L, 11L, 122L, 121L, 6L, 5L,
2L, 1L, 15L, 14L, 7L, 8L, 122L, 121L, 5L, 6L, 55L, 54L, 3L, 4L,
36L, 37L, 53L, 52L, 26L, 27L, 28L, 29L, 37L, 36L, 21L, 22L, 46L,
47L, 22L, 1L, 39L, 40L, 27L, 28L, 28L, 29L, 35L, 34L, 121L, 5L,
6L, 21L, 20L, 6L, 7L, 24L, 25L, 66L, 65L, 15L, 16L, 121L, 122L,
18L, 17L, 3L, 2L, 13L, 12L, 121L, 122L, 44L, 45L, 20L, 21L, 29L,
30L, 54L, 55L, 22L, 23L, 40L, 39L, 8L, 9L, 16L, 17L, 3L, 2L,
121L, 122L, 122L, 10L, 11L, 57L, 56L, 19L, 18L, 5L, 4L, 15L, 16L, 121L, 3L, 4L, 16L, 17L, 121L, 122L, 6L, 5L, 1L, 2L, 3L, 4L, 8L, 9L, 3L, 4L, 13L, 14L, 59L, 58L, 12L, 13L, 36L, 37L, 122L,
121L, 17L, 16L, 2L, 3L, 6L, 5L, 16L, 17L, 121L, 122L, 1L, 121L,
122L, 121L, 122L, 1L, 1L, 1L, 4L, 5L, 121L, 122L, 1L, 2L, 3L,
2L, 121L, 122L, 121L, 122L, 18L, 19L, 18L, 19L, 121L, 122L),
     ALTPRC = c(9.9375, 9.875, 0.45313, 0.67188, 7.875, 10, 18,
     22, 14.75, 9.75, 0.375, 0.15625, 3.9375, 16, 14.25, 7, 7.125,
     27.25, 23.375, 10.75, 13, 3.125, 3.125, 2.6875, 3.4375, 0.5,
     8.75, 7, 16.875, 12.375, 2.40625, 3.96875, 4, 4.625, 4.5,
     5.125, 26.25, 28.75, 4.5, 5.5, 21.75, 23.25, 15, 14.375,
     16.625, 14, 50.5, 48.75, 31.875, 33.125, 41.5, 46, 21, 22.125,
     30.75, 30.125, 10.375, 5.5, 11.5, 11, 29, 28.875, 27.25,
     26.75, 22.375, 22.25, 33.375, 35, 21, 19.75, 29.875, 28.875,
     22.125, 20.125, 21, 18.875, 24.625, 26.75, 21.75, 22, 22.125,
     21.125, 24.75, 26.75, 42.75, 43.5, 13.375, 29.625, 0.07813,
     25.125, 23.75, 18, 20, 17.5, 18.125, 18.875, 19, 28.875,
     30, 23.875, 23.625, 15.5, 15.625, 17.5, 19.5, 34.75, 30.75,
     2, 2.25, 18.625, 17.5, 21.375, 19.875, 45.25, 20.125, 37.25,
     41.75, 32.25, 32.5, 23.125, 21.875, 35.25, 38.75, 27.875,
     27.375, 35.875, 42.125, 24.25, 24.5, 25.125, 23.875, 2.0625,
     16.75, 16.25, 34.625, 37.75, 40, 31.625, 19.375, 20, 30.875,
     29.375, 0.125, 17.625, 17, 16.625, 17.75, 12.625, 13.25,
     26, 19.75, 15.25, 18.625, 18.125, 18, 16.375, 15.625, 18.5,
     19, 12.875, 14.375, 32.375, 33.375, 16.375, 16.375, 1.625,
     2.8125, 13.875, 14.625, 4.625, 4.5, 18.5, 24.125, 6.375,
     5.875, 10.625, 11.625, 6.625, 7.375, 14.75, 0.8125, 0.6875,
     2.125, 2.375, 20.25, 7.625, 34, 15.25, 15, 2.09375, 2.375,
     19.5, 18.125, 38.5, 30.75, 36, 35.75, 9.375, 11.25, 21.25,
     18.625, 6, 5.25, 1.15625, 1.25), HSICIG = c(492, 492, 494,
     494, 495, 495, 495, 495, 495, 495, 493, 493, 495, 495, 495,
     495, 495, 493, 493, 492, 492, 495, 495, 495, 495, 495, 495,
     495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495, 495,
     495, 493, 493, 494, 494, 492, 492, 492, 492, 493, 493, 492,
     492, 493, 493, 493, 493, 495, 495, 492, 492, 493, 493, 493,
     493, 493, 493, 493, 493, 493, 493, 493, 493, 493, 493, 493,
     493, 493, 493, 492, 492, 493, 493, 492, 492, 493, 493, 492,
     492, 495, 492, 492, 494, 494, 492, 492, 492, 492, 493, 493,
     492, 492, 494, 494, 492, 492, 492, 492, 492, 492, 492, 492,
     493, 493, 493, 493, 492, 492, 493, 493, 493, 493, 492, 492,
     492, 492, 495, 495, 494, 494, 494, 494, 495, 492, 492, 493,
     493, 495, 495, 492, 492, 492, 492, 495, 492, 492, 492, 492,
     492, 492, 492, 492, 494, 494, 492, 492, 492, 492, 492, 492,
     495, 495, 493, 493, 492, 492, 495, 495, 492, 492, 492, 492,
     495, 495, 495, 495, 495, 495, 492, 492, 495, 495, 495, 492,
     492, 495, 495, 495, 492, 492, 494, 494, 492, 492, 495, 495,
     492, 492, 493, 493, 494, 494, 495, 495, 495, 495), Market.Cap.13f = 
c(9937500,
     9875000, 3171910, 4703160, 6.3e+07, 8e+07, 1.44e+08, 1.76e+08,
     88500000, 58500000, 3e+06, 1250000, 15750000, 6.4e+07, 5.7e+07,
     1.26e+08, 135375000, 6.213e+09, 5329500000, 21500000, 2.6e+07,
     9375000, 9375000, 13437500, 17187500, 3500000, 78750000,
     6.3e+07, 101250000, 74250000, 4812500, 7937500, 1.2e+07,
     13875000, 31500000, 35875000, 367500000, 431250000, 13500000,
     16500000, 9330750000, 9974250000, 2.55e+08, 2.3e+08, 33250000,
     2.8e+07, 2171500000, 1998750000, 4048125000, 4173750000,
     3.569e+09, 3.956e+09, 3.15e+08, 331875000, 1568250000, 1536375000,
     72625000, 38500000, 34500000, 3.3e+07, 1.943e+09, 1934625000,
     5749750000, 5644250000, 783125000, 778750000, 467250000,
     4.9e+08, 1.029e+09, 967750000, 358500000, 346500000, 486750000,
     442750000, 6.51e+08, 585125000, 98500000, 1.07e+08, 8.7e+07,
     1.1e+08, 752250000, 718250000, 1.584e+09, 1.712e+09, 2.394e+09,
     2.436e+09, 361125000, 799875000, 3672110, 703500000, 6.65e+08,
     3.6e+07, 4e+07, 1.75e+08, 181250000, 1.51e+08, 1.52e+08,
     375375000, 3.9e+08, 2077125000, 2055375000, 15500000, 15625000,
     52500000, 58500000, 3509750000, 3105750000, 7.6e+07, 8.1e+07,
     912625000, 9.8e+08, 470250000, 437250000, 11086250000, 4970875000,
     1.341e+09, 1461250000, 193500000, 1.95e+08, 508750000, 481250000,
     1057500000, 1162500000, 306625000, 301125000, 5417125000,
     6360875000, 48500000, 4.9e+07, 75375000, 71625000, 8250000,
     6.7e+07, 6.5e+07, 346250000, 377500000, 1.872e+10, 14515875000,
     193750000, 2e+08, 4.94e+08, 4.7e+08, 3375000, 1.41e+08, 1.36e+08,
     315875000, 337250000, 37875000, 39750000, 1.82e+08, 138250000,
     228750000, 279375000, 108750000, 1.08e+08, 98250000, 93750000,
     240500000, 2.47e+08, 772500000, 862500000, 356125000, 367125000,
     163750000, 163750000, 1.3e+07, 22500000, 2122875000, 2.223e+09,
     32375000, 31500000, 3.811e+09, 4969750000, 31875000, 29375000,
     42500000, 46500000, 1629750000, 2205125000, 5.9e+07, 0, 0,
     27625000, 30875000, 141750000, 38125000, 3.4e+08, 106750000,
     1.05e+08, 23031250, 26125000, 3.12e+08, 2.9e+08, 2.31e+08,
     184500000, 3.6e+07, 35750000, 65625000, 78750000, 212500000,
     186250000, 3e+07, 26250000, 11562500, 12500000), IPO.Flag = c(0,
     0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
     1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
     0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0), IPO.Issue.Date = c(NA,
     NA, NA, NA, NA, NA, 19860724, 19860724, NA, NA, 19870127,
     19870127, NA, NA, NA, NA, NA, NA, NA, NA, NA, 19870811, 19870811,
     19870930, 19870930, NA, 19871124, 19871124, 19880225, 19880225,
     19880602, 19880602, NA, NA, 19880802, 19880802, NA, NA, NA,
     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 19710324, 19710324,
     NA, NA, NA, NA, NA, NA, NA, NA, NA, 19710617, 19710617, NA,
     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
     19831014, 19831014, 19861016, 19861016, NA, NA, 19860502,
     19860502, NA, NA, 19890419, NA, NA, NA, NA, 19900412, 19900514,
     19900518, NA, NA, NA, NA, NA, NA, 19830603, 19830603, NA,
     NA, NA, NA, 19851206, 19851206, 19851211, 19851211, NA, NA
     ), Quarters.Since.IPO.Issue = c(NA, NA, NA, NA, NA, NA, 15L,
     16L, NA, NA, 13L, 14L, NA, NA, NA, NA, NA, NA, NA, NA, NA,
     12L, 11L, 11L, 10L, NA, 11L, 10L, 10L, 9L, 8L, 9L, NA, NA,
     8L, 7L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
     NA, 77L, 78L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 77L, 76L,
     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
     NA, 27L, 26L, 14L, 15L, NA, NA, 16L, 17L, NA, NA, 5L, NA,
     NA, NA, NA, 1L, 1L, 1L, NA, NA, NA, NA, NA, NA, 29L, 28L,
     NA, NA, NA, NA, 18L, 19L, 18L, 19L, NA, NA)), .Names = c("PERMNO",
"DATE", "Shares.Owned", "Shares.Outstanding.13f", "Percent.Inst.Owned",
"Latest.Issue.Date.ByPERMNO", "Quarters.Since.19800101", 
"Quarters.Since.Latest.Issue",

"ALTPRC", "HSICIG", "Market.Cap.13f", "IPO.Flag", "IPO.Issue.Date",
"Quarters.Since.IPO.Issue"), row.names = c(79L, 85L, 9902L, 9908L,
15739L, 15758L, 16673L, 16675L, 20159L, 20160L, 32879L, 32889L,
38023L, 39404L, 39409L, 40405L, 40420L, 43114L, 43116L, 47939L,
47953L, 48091L, 48120L, 52828L, 52837L, 54612L, 55002L, 55048L,
56506L, 56508L, 59230L, 59247L, 60454L, 60461L, 60845L, 60852L,
66143L, 66147L, 69439L, 69454L, 72218L, 72232L, 81826L, 81840L,
87882L, 87883L, 105814L, 105832L, 106687L, 106709L, 106867L,
106875L, 110008L, 110081L, 113124L, 113125L, 113448L, 113460L,
114419L, 114431L, 116222L, 116234L, 117215L, 117310L, 119463L,
119477L, 119913L, 119927L, 120787L, 120799L, 121214L, 121215L,
121541L, 121548L, 121670L, 121680L, 122420L, 122421L, 123629L,
123679L, 124479L, 124485L, 125607L, 125608L, 126683L, 126716L,
126911L, 126954L, 126986L, 128941L, 128979L, 132991L, 133048L,
133090L, 133091L, 134227L, 134228L, 137449L, 137465L, 146656L,
146710L, 151717L, 151728L, 162724L, 162738L, 186344L, 186346L,
194239L, 194251L, 195124L, 195125L, 203411L, 203426L, 203486L,
203487L, 206821L, 206863L, 218733L, 218734L, 219083L, 219084L,
232389L, 232401L, 241221L, 241222L, 262518L, 262556L, 263151L,
263154L, 264783L, 264811L, 275743L, 278957L, 278958L, 281230L,
281242L, 281957L, 281962L, 286492L, 286504L, 294444L, 294445L,
297641L, 298974L, 298988L, 304628L, 304669L, 306326L, 306339L,
315987L, 316013L, 316939L, 316940L, 327003L, 327032L, 327976L,
327977L, 328372L, 328386L, 328621L, 328622L, 329277L, 329289L,
329983L, 329984L, 331735L, 331746L, 350849L, 350887L, 357747L,
357750L, 366913L, 366917L, 380680L, 380749L, 385635L, 385642L,
394280L, 394281L, 410203L, 417419L, 417420L, 418842L, 418851L,
423401L, 423687L, 423795L, 494497L, 494498L, 496519L, 496520L,
576735L, 576737L, 590042L, 590057L, 606077L, 606087L, 620736L,
620737L, 704834L, 704837L, 749540L, 749573L, 754161L, 754162L
), class = "data.frame")

     colnames = names(tfdata) #grab the colnames, which we will shove back to the result at the end when we reconvert to data.frame

     quarterends = sort(unique(tfdata$DATE)) # the data are quarterly, all dates are quarter ends

# basic code logic:
# grab each quarter's data, in each quarter get the ipo subset, and
the eligible matching firm subset

# for each ipo from the ipo subset, select a matching firm from the
eligible matching firm subset

# the said selection is done based on industry group (HSICIG), and
market cap (Market.Cap.13f)

# Industry group has to be the same, market cap has to be 'closest
one from above', or if that is not available, then 'closest one from below'.

     for (aquarter in quarterends) {
         tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
         tfdata_quarter_fitting_nonissuers = tfdata_quarter[ 
(tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) & (tfdata_quarter$IPO.Flag == 0), ]

         tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ]

         for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
             arow = tfdata_quarter_ipoissuers[i,]
             industrypeers = tfdata_quarter_fitting_nonissuers[ 
tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
             industrypeers = industrypeers[ 
order(industrypeers$Market.Cap.13f), ]
             if ( nrow(industrypeers) > 0 ) {
                 if ( nrow(industrypeers[industrypeers$Market.Cap.13f >= 
arow$Market.Cap.13f, ]) > 0 ) {
                     bestpeer = 
industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ][1,]
                 }
                 else {
                     bestpeer = industrypeers[nrow(industrypeers),]
                 }
                 bestpeer$Quarters.Since.IPO.Issue = 
arow$Quarters.Since.IPO.Issue
                 bestpeer$Peer.To.PERMNO = arow$PERMNO
                 result = rbind(result, as.matrix(bestpeer))
             }
         }
         print (aquarter)
     }

     result = as.data.frame(result)
     names(result) = c(colnames, 'Peer.To.PERMNO')
     return(result)

}
============== end function code===========

on 06/06/2008 01:35 PM Gabor Grothendieck said the following:

> I think the posting guide may not be clear enough and have suggested that
> it be clarified.  Hopefully this better communicates what is required and why
> in a shorter amount of space:
> 
> https://stat.ethz.ch/pipermail/r-devel/2008-June/049891.html
> 
> 
> On Fri, Jun 6, 2008 at 1:25 PM, Daniel Folkinshteyn <dfolkins_at_gmail.com> wrote:
>> i thought since the function code (which i provided in full) was pretty
>> short, it would be reasonably easy to just read the code and see what it's
>> doing.
>>
>> but ok, so... i am attaching a zip file, with a small sample of the data set
>> (tab delimited), and the function code, in a zip file (posting guidelines
>> claim that "some archive formats" are allowed, i assume zip is one of
>> them...
>>
>> would appreciate your comments! :)
>>
>> on 06/06/2008 12:05 PM Gabor Grothendieck said the following:
>>> Its summarized in the last line to r-help.  Note reproducible and
>>> minimal.
>>>
>>> On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn <dfolkins_at_gmail.com>
>>> wrote:
>>>> i did! what did i miss?
>>>>
>>>> on 06/06/2008 11:45 AM Gabor Grothendieck said the following:
>>>>> Try reading the posting guide before posting.
>>>>>
>>>>> On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn
>>>>> <dfolkins_at_gmail.com>
>>>>> wrote:
>>>>>> Anybody have any thoughts on this? Please? :)
>>>>>>
>>>>>> on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:
>>>>>>> Hi everyone!
>>>>>>>
>>>>>>> I have a question about data processing efficiency.
>>>>>>>
>>>>>>> My data are as follows: I have a data set on quarterly institutional
>>>>>>> ownership of equities; some of them have had recent IPOs, some have
>>>>>>> not
>>>>>>> (I
>>>>>>> have a binary flag set). The total dataset size is 700k+ rows.
>>>>>>>
>>>>>>> My goal is this: For every quarter since issue for each IPO, I need to
>>>>>>> find a "matched" firm in the same industry, and close in market cap.
>>>>>>> So,
>>>>>>> e.g., for firm X, which had an IPO, i need to find a matched
>>>>>>> non-issuing
>>>>>>> firm in quarter 1 since IPO, then a (possibly different) non-issuing
>>>>>>> firm in
>>>>>>> quarter 2 since IPO, etc. Repeat for each issuing firm (there are
>>>>>>> about
>>>>>>> 8300
>>>>>>> of these).
>>>>>>>
>>>>>>> Thus it seems to me that I need to be doing a lot of data selection
>>>>>>> and
>>>>>>> subsetting, and looping (yikes!), but the result appears to be highly
>>>>>>> inefficient and takes ages (well, many hours). What I am doing, in
>>>>>>> pseudocode, is this:
>>>>>>>
>>>>>>> 1. for each quarter of data, getting out all the IPOs and all the
>>>>>>> eligible
>>>>>>> non-issuing firms.
>>>>>>> 2. for each IPO in a quarter, grab all the non-issuers in the same
>>>>>>> industry, sort them by size, and finally grab a matching firm closest
>>>>>>> in
>>>>>>> size (the exact procedure is to grab the closest bigger firm if one
>>>>>>> exists,
>>>>>>> and just the biggest available if all are smaller)
>>>>>>> 3. assign the matched firm-observation the same "quarters since issue"
>>>>>>> as
>>>>>>> the IPO being matched
>>>>>>> 4. rbind them all into the "matching" dataset.
>>>>>>>
>>>>>>> The function I currently have is pasted below, for your reference. Is
>>>>>>> there any way to make it produce the same result but much faster?
>>>>>>> Specifically, I am guessing eliminating some loops would be very good,
>>>>>>> but I
>>>>>>> don't see how, since I need to do some fancy footwork for each IPO in
>>>>>>> each
>>>>>>> quarter to find the matching firm. I'll be doing a few things similar
>>>>>>> to
>>>>>>> this, so it's somewhat important to up the efficiency of this. Maybe
>>>>>>> some of
>>>>>>> you R-fu masters can clue me in? :)
>>>>>>>
>>>>>>> I would appreciate any help, tips, tricks, tweaks, you name it! :)
>>>>>>>
>>>>>>> ========== my function below ===========
>>>>>>>
>>>>>>> fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,
>>>>>>> quarters_since_issue=40) {
>>>>>>>
>>>>>>>  result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is
>>>>>>> cheaper, so typecast the result to matrix
>>>>>>>
>>>>>>>  colnames = names(tfdata)
>>>>>>>
>>>>>>>  quarterends = sort(unique(tfdata$DATE))
>>>>>>>
>>>>>>>  for (aquarter in quarterends) {
>>>>>>>      tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
>>>>>>>
>>>>>>>      tfdata_quarter_fitting_nonissuers = tfdata_quarter[
>>>>>>> (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) &
>>>>>>> (tfdata_quarter$IPO.Flag == 0), ]
>>>>>>>      tfdata_quarter_ipoissuers = tfdata_quarter[
>>>>>>> tfdata_quarter$IPO.Flag
>>>>>>> == 1, ]
>>>>>>>
>>>>>>>      for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
>>>>>>>          arow = tfdata_quarter_ipoissuers[i,]
>>>>>>>          industrypeers = tfdata_quarter_fitting_nonissuers[
>>>>>>> tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]
>>>>>>>          industrypeers = industrypeers[
>>>>>>> order(industrypeers$Market.Cap.13f), ]
>>>>>>>          if ( nrow(industrypeers) > 0 ) {
>>>>>>>              if ( nrow(industrypeers[industrypeers$Market.Cap.13f >=
>>>>>>> arow$Market.Cap.13f, ]) > 0 ) {
>>>>>>>                  bestpeer = industrypeers[industrypeers$Market.Cap.13f
>>>>>>>> = arow$Market.Cap.13f, ][1,]
>>>>>>>              }
>>>>>>>              else {
>>>>>>>                  bestpeer = industrypeers[nrow(industrypeers),]
>>>>>>>              }
>>>>>>>              bestpeer$Quarters.Since.IPO.Issue =
>>>>>>> arow$Quarters.Since.IPO.Issue
>>>>>>>
>>>>>>> #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==
>>>>>>> bestpeer$PERMNO] = 1
>>>>>>>              result = rbind(result, as.matrix(bestpeer))
>>>>>>>          }
>>>>>>>      }
>>>>>>>      #result = rbind(result, tfdata_quarter)
>>>>>>>      print (aquarter)
>>>>>>>  }
>>>>>>>
>>>>>>>  result = as.data.frame(result)
>>>>>>>  names(result) = colnames
>>>>>>>  return(result)
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> ========= end of my function =============
>>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help_at_r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri 06 Jun 2008 - 19:21:30 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 06 Jun 2008 - 19:30:40 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive