Re: [R] Alternatives to merge for large data sets?

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Thu 07 Sep 2006 - 08:57:58 GMT

Which version of R?

Please try 2.4.0 alpha, as it has a different and more efficient algorithm for the case of 1-1 matches.

On Wed, 6 Sep 2006, Adam D. I. Kramer wrote:

> Hello,
>
> I am trying to merge two very large data sets, via
>
> pubbounds.prof <-
> merge(x=pubbounds,y=prof,by.x="user",by.y="userid",all=TRUE,sort=FALSE)
>
> which gives me an error of
>
> Error: cannot allocate vector of size 2962 Kb
>
> I am reasonably sure that this is correct syntax.
>
> The trouble is that pubbounds and prof are large; they are data frames which
> take up 70M and 11M respectively when saved as .Rdata files.
>
> I understand from various archive searches that "merge can't handle that,"
> because merge takes n^2 memory, which I do not have.

Not really true (it has been changed since those days). Of course, if you have multiple matches it must do so.

> My question is whether there is an alternative to merge which would carry
> out the process in a slower, iterative manner...or if I should just bite the
> bullet, write.table, and use a perl script to do the job.
>
> Thankful as always,
> Adam D. I. Kramer

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Thu Sep 07 19:05:05 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 07 Sep 2006 - 19:43:39 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.