Re: [R] Alternatives to merge for large data sets?

From: Adam D. I. Kramer <adik_at_ilovebacon.org>
Date: Thu 07 Sep 2006 - 18:46:04 GMT

On Thu, 7 Sep 2006, Prof Brian Ripley wrote:

> Which version of R?

Previously, 2.3.1.

> Please try 2.4.0 alpha, as it has a different and more efficient
> algorithm for the case of 1-1 matches.

I downloaded and installed R-latest, but got the same error message:

Error: cannot allocate vector of size 7301 Kb

...though at least the too-big size was larger this time.

My data set is not exactly 1-1; every item in "prof" may have one or more matches in "pubbounds," though every item in "pubbounds" corrosponds only to one "prof."

--Adam

>
> On Wed, 6 Sep 2006, Adam D. I. Kramer wrote:
>
>> Hello,
>>
>> I am trying to merge two very large data sets, via
>>
>> pubbounds.prof <-
>> merge(x=pubbounds,y=prof,by.x="user",by.y="userid",all=TRUE,sort=FALSE)
>>
>> which gives me an error of
>>
>> Error: cannot allocate vector of size 2962 Kb
>>
>> I am reasonably sure that this is correct syntax.
>>
>> The trouble is that pubbounds and prof are large; they are data frames which
>> take up 70M and 11M respectively when saved as .Rdata files.
>>
>> I understand from various archive searches that "merge can't handle that,"
>> because merge takes n^2 memory, which I do not have.
>
> Not really true (it has been changed since those days). Of course, if you
> have multiple matches it must do so.
>
>> My question is whether there is an alternative to merge which would carry
>> out the process in a slower, iterative manner...or if I should just bite the
>> bullet, write.table, and use a perl script to do the job.
>>
>> Thankful as always,
>> Adam D. I. Kramer
>
> --
> Brian D. Ripley, ripley@stats.ox.ac.uk
> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> University of Oxford, Tel: +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272866 (PA)
> Oxford OX1 3TG, UK Fax: +44 1865 272595
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri Sep 08 04:50:32 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 07 Sep 2006 - 19:43:39 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.