Re: [R] memory problem in handling large dataset

From: Weiwei Shi <helprhelp_at_gmail.com>
Date: Fri 28 Oct 2005 - 03:24:56 EST

Hi, Jim:
Thanks for the calculation. I think you won't mind if I cc the reply to r-help too so that I can get more info.

I assume you use 4 bytes for integer and 8 bytes for float, so 300x8+50x4=2600 bytes for each observation, right?

I wish I could have 500x8 G memory :) just kidding.. definately, sampling will be proceeded as the first step. Some feature selections (filtering, mainly) will be applied. Accepting Berton's suggestion, I will probably use python to do the sampling since whenever I have some "slow" situations like this, python never fails me. (I am not saying R is bad though)

I understand "I get what I pay" here. But more information or experience on R's handling large dataset (like using RMySQL) will be appreciated.

regards,

Weiwei

On 10/27/05, jim holtman <jholtman@gmail.com> wrote:
> Based on the numbers that you gave, if you wanted all the data in memory at
> once, you would need 4.4TB of memory, about 500X what you currently have.
> Each of you observation will require about 2,600 bytes of memory. You
> probably don't want to have more than 25% for a single object since many of
> the algorithms make copies. This would limit you to about 700,000
> observations at a time for processing.
>
> The real question is what are you trying to do with the data. Can you
> partition the data and do analysis on the subsets?
>
>
> On 10/27/05, Weiwei Shi <helprhelp@gmail.com> wrote:
> >
> > Dear Listers:
> > I have a question on handling large dataset. I searched R-Search and I
> > hope I can get more information as to my specific case.
> >
> > First, my dataset has 1.7 billion observations and 350 variables,
> > among which, 300 are float and 50 are integers.
> > My system has 8 G memory, 64bit CPU, linux box. (currently, we don't
> > plan to buy more memory).
> >
> > > R.version
> > _
> > platform i686-redhat-linux-gnu
> > arch i686
> > os linux-gnu
> > system i686, linux-gnu
> > status
> > major 2
> > minor 1.1
> > year 2005
> > month 06
> > day 20
> > language R
> >
> >
> > If I want to do some analysis for example like randomForest on a
> > dataset, how many max observations can I load to get the machine run
> > smoothly?
> >
> > After figuring out that number, I want to do some sampling first, but
> > I did not find read.table or scan can do this. I guess I can load it
> > into mysql and then use RMySQL do the sampling or use python to subset
> > the data first. My question is, is there a way I can subsample
> > directly from file just using R?
> >
> > Thanks,
> > --
> > Weiwei Shi, Ph.D
> >
> > "Did you always know?"
> > "No, I did not. But I believed..."
> > ---Matrix III
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
> >
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 247 0281
>
> What the problem you are trying to solve?

--
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Fri Oct 28 03:52:39 2005

This archive was generated by hypermail 2.1.8 : Fri 28 Oct 2005 - 06:17:51 EST