Re: [R] Help with R

From: Christoph Lehmann <christoph.lehmann_at_gmx.ch>
Date: Thu 05 May 2005 - 23:05:55 EST

>>I
>>heard that 'R' does not do a very good job at handling large datasets, is
>>this true?
>
importing huge datasets in a data.frame with e.g. a subsequent step of conversion of some columns into factors may lead into memory troubles (probably due to memory overhead when building out factors). But we currently succeeded in importing 12 millions of data records stored in a MySQL database, using RMySQL package. The procedure which lead to success was:

0 define a data.frame 'data.total' with the size necessary to keep the whole data set to be imported
in a loop do:

   1 import the data in chunks of eg 30000 records per chunk and save it in a temporary data.frame 'data.chunk'

   2 the conversion into factors and other preprocessing steps, such as data aggregation should be done for each single chunk saved in 'data.chunk' after import

   3 the now preprocessed chunk is saved into the appropriate part of the at the beginning defined data.frame 'data.total'

4 whole dataset is imported and data.frame 'data.total' is ready for further computational steps

in a nutshell: preprocessing steps such as conversion into factors yield memory troubles, even for data.sets which per se don't take too much memory- but done separately in smaller chunks of data, it can be done with R very efficiently. The 'team' MySQL together with R is VERY powerful

Cheers
Christoph



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Thu May 05 22:14:02 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:31:35 EST