[R] memory problem in handling large dataset

From: Weiwei Shi <helprhelp_at_gmail.com>
Date: Fri 28 Oct 2005 - 02:27:46 EST

Dear Listers:
I have a question on handling large dataset. I searched R-Search and I hope I can get more information as to my specific case.

First, my dataset has 1.7 billion observations and 350 variables, among which, 300 are float and 50 are integers. My system has 8 G memory, 64bit CPU, linux box. (currently, we don't plan to buy more memory).

> R.version

platform i686-redhat-linux-gnu

arch     i686
os       linux-gnu

system i686, linux-gnu
major 2
minor 1.1
year 2005
month 06
day 20
language R

If I want to do some analysis for example like randomForest on a dataset, how many max observations can I load to get the machine run smoothly?

After figuring out that number, I want to do some sampling first, but I did not find read.table or scan can do this. I guess I can load it into mysql and then use RMySQL do the sampling or use python to subset the data first. My question is, is there a way I can subsample directly from file just using R?


Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

R-help@stat.math.ethz.ch mailing list
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Fri Oct 28 04:34:46 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:40:51 EST