Re: [R] merge( , by='row.names') slowness

From: <rex.dwyer_at_syngenta.com>
Date: Wed, 02 Mar 2011 18:12:36 -0500

-----Original Message-----
From: r-help-bounces_at_r-project.org [mailto:r-help-bounces_at_r-project.org] On Behalf Of dms Sent: Wednesday, March 02, 2011 3:16 PM
To: r-help_at_r-project.org
Subject: [R] merge( , by='row.names') slowness

I noticed that joining two data.frames in R using the "merge" function that using by='row.names' slows things down substantially when compared to just joining on a common index column.

Using a dataframe size of ~10,000 rows: it's as slow as 10 minutes in the by='row.names' case versus merely 1 second using an index column. Beyond the 10^6 range, it's unusably slow.

n <- 5
a <- data.frame(id=as.character(1:10^n), x=rnorm(10^n)); rownames(a) <- a$id
b <- data.frame(id=as.character(1:10^n + 10^(n-1)), y=rnorm(10^n)); rownames(b) <- b$id

date()
fast <- merge(a, b, all=T)
date()
slow <- merge(a, b, all=T, by='row.names') date()

Has anybody else noticed this?


HI DMS,
Well, first off, they don't give the same answer... in fact, not even the same dimension. Even so, from looking at merge.data.frame, it's not immediately obvious what would make a difference of this magnitude. The answer might be buried in the internal merge.

Here for n=3:
> system.time(print(dim(merge(a,b,all=T))))
[1] 1100 3

   user system elapsed
   0.01 0.00 0.01
> system.time(print(dim(merge(a,b,all=T,by=1))))
[1] 1100 3

   user system elapsed
   0.01 0.00 0.02
> system.time(print(dim(merge(a,b,all=T,by=0))))
[1] 1100 5

   user system elapsed
   3.26 0.00 3.17
> system.time(print(dim(merge(a,b,all=T,by="row.names"))))
[1] 1100 5

   user system elapsed
   3.17 0.00 3.17
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited.



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 02 Mar 2011 - 23:21:34 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 03 Mar 2011 - 00:20:18 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive