Re: [Rd] as.data.frame requires a lot of memory (PR#14140)

From: Simon Urbanek <simon.urbanek_at_r-project.org>
Date: Mon, 14 Dec 2009 17:01:19 -0500

On Dec 14, 2009, at 12:45 , rfalke_at_tzi.de wrote:

> Full_Name: Raimar Falke
> Version: R version 2.10.0 (2009-10-26)
> OS: Linux 2.6.27-16-generic #1 SMP Tue Dec 1 19:26:23 UTC 2009
> x86_64 GNU/Linux
> Submission from: (NULL) (134.102.222.56)
>
>
> The construction of a data frame in the way shown below requires
> much more memory than expected. If we assume a cell value takes 8
> bytes
> the total amount of the data is 128mb. However the process takes about
> 920mb and not the expected 256mb (two times the data set).
>
> With the real data sets (~35000 observations with ~33000 attributes)
> the
> conversion to a data frame requires has to be killed at with 60gb of
> memory usage while it should only require 17.6gb (2*8.8gb).
>
> dfn <- rep(list(rep(0, 4096)), 4096)
> test <- as.data.frame.list(dfn)
>
> I also tried the incremental construction of the
> data-frame: df$colN <- dataForColN. While I currently can't say much
> about the memory usage, it takes a looong time.
>
> After the construction the saved-and-loaded data-frame has the
> expected size.
>
> What is the recommended way to construct larger data-frames?
>

Please use R-help for questions, and not the bug tracking system!

There are few issues with your example - mainly because is has no row names and no column names so R will try to create them from the expression which is inherently slow and memory-consuming. So first, make sure you set the names on the list, e.g.:

names(dfn) <- paste("V",seq.int(length(dfn)),sep='')

That will reduce the overhead due to column names. Then what as.data.frame does is to simply call data.frame on the elements of the list. That ensures that all is right, but if you know for sure that your list is valid (correct lengths, valid names, no need for row names etc.) then you can simply assert that it is a data frame:

class(dfn)<-"data.frame"
row.names(dfn)<-NULL

You'll still need double the memory because the object needs to be copied for the attribute modifications, but that's as low as it get -- although in your exact example there is an even more efficient way:

dfn <- rep(data.frame(X=rep(0, 4096)), 4096) dfn <- do.call("cbind", dfn)

it uses only a fraction more memory than the size of the entire object, but that's for entirely different reasons :). No, it's not good in general :P

Cheers,
Simon



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Mon 14 Dec 2009 - 22:04:27 GMT

This archive was generated by hypermail 2.2.0 : Mon 14 Dec 2009 - 22:11:06 GMT