Re: [R] how to use large data set ?

From: bogdan romocea <br44114_at_gmail.com>
Date: Thu 20 Jul 2006 - 22:57:23 EST


By far, the cheapest and easiest solution (and the very first to try) is to add more memory. The cost depends on what kind you need, but here's for example 2 GB you can buy for only $150: http://www.newegg.com/Product/Product.asp?Item=N82E16820144157

Project constraints?! If they don't want to spend a couple hundred USD for memory, you're working on the wrong project (and/or for the wrong organization). Buying more memory (say up to a few GB) is orders of magnitude cheaper than the licenses for some proprietary software that can get around memory constraints, and probably (much) cheaper than the loss of productivity caused by the extra training and setup time needed to try to implement an alternative solution (such as a connection to a DBMS). And even if the extra memory needed for R were as expensive as the license for a proprietary software, which choice would be more reasonable?

> -----Original Message-----
> From: r-help-bounces@stat.math.ethz.ch
> [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of mahesh r
> Sent: Wednesday, July 19, 2006 4:23 PM
> To: r-help@stat.math.ethz.ch
> Subject: Re: [R] how to use large data set ?
>
> Hi,
> I would like to extend to the query posted earlier on using large data
> bases. I am trying to use Rgdal to mine within the remote
> sensing imageries.
> I dont have problems bring the images within the R
> environment. But when I
> try to convert the images to a data.frame I receive an
> warning message from
> R saying "1: Reached total allocation of 510Mb: see
> help(memory.size)" and
> the process terminates. Due to project constarints I am given a very
> old 2.4Ghz computer with only 512 MB RAM. I think what R is currently
> doing is
> trying to store the results in the RAM and since the image
> size is very big
> (some 9 million pixels), I think it gets out of memory.
>
> My question is
> 1. Is there any possibility to dump the temporary variables
> in a temp folder
> within the hard disk (as many softwares do) instead of leting
> R store them
> in RAM
> 2. Could this be possible without creating a connection to a
> any back hand
> database like Oracle.
>
> Thanks,
>
> Mahesh
>
>
> On 7/19/06, Greg Snow <Greg.Snow@intermountainmail.org> wrote:
> >
> > You did not say what analysis you want to do, but many
> common analyses
> > can be done as special cases of regression models and you
> can use the
> > biglm package to do regression models.
> >
> > Here is an example that worked for me to get the mean and standard
> > deviation by day from an oracle database with over 23
> million rows (I
> > had previously set up 'edw' as an odbc connection to the
> database under
> > widows, any of the database connections packages should work for you
> > though):
> >
> > library(RODBC)
> > library(biglm)
> >
> > con <- odbcConnect('edw',uid='glsnow',pwd=pass)
> >
> > odbcQuery(con, "select ADMSN_WEEKDAY_CD, LOS_DYS from
> CM.CASEMIX_SMRY")
> >
> > t1 <- Sys.time()
> >
> > tmp <- sqlGetResults(con, max=100000)
> >
> > names(tmp) <- c("Day","LoS")
> > tmp$Day <- factor(tmp$Day, levels=as.character(1:7))
> > tmp <- na.omit(tmp)
> > tmp <- subset(tmp, LoS > 0)
> >
> > ff <- log(LoS) ~ Day
> >
> > fit <- biglm(ff, tmp)
> >
> > i <- nrow(tmp)
> > while( !is.null(nrow( tmp <- sqlGetResults(con, max=100000) ) ) ){
> > names(tmp) <- c("Day","LoS")
> > tmp$Day <- factor(tmp$Day, levels=as.character(1:7))
> > tmp <- na.omit(tmp)
> > tmp <- subset(tmp, LoS > 0)
> >
> > fit <- update(fit,tmp)
> >
> > i <- i + nrow(tmp)
> > cat(format(i,big.mark=',')," rows processed\n")
> > }
> >
> > summary(fit)
> >
> > t2 <- Sys.time()
> >
> > t2-t1
> >
> > Hope this helps,
> >
> > --
> > Gregory (Greg) L. Snow Ph.D.
> > Statistical Data Center
> > Intermountain Healthcare
> > greg.snow@intermountainmail.org
> > (801) 408-8111
> >
> >
> > -----Original Message-----
> > From: r-help-bounces@stat.math.ethz.ch
> > [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of
> Yohan CHOUKROUN
> > Sent: Wednesday, July 19, 2006 9:42 AM
> > To: 'r-help@stat.math.ethz.ch'
> > Subject: [R] how to use large data set ?
> >
> > Hello R users,
> >
> >
> >
> > Sorry for my English, i'm French.
> >
> >
> >
> > I want to use a large dataset (3 millions of rows and 70 var) but I
> > don't know how to do because my computer crash quickly (P4
> 2.8Ghz, 1Go
> > ).
> >
> > I have also a bi Xeon with 2Go so I want to do computation on this
> > computer and show the results on mine. Both of them are on
> Windows XP...
> >
> >
> >
> > To do shortly I have:
> >
> >
> >
> > 1 server with a MySQL database
> >
> > 1computer
> >
> > and I want to use them with a large dataset.
> >
> >
> >
> > I'm trying to use RDCOM to connect the database and
> installing (but it's
> > hard for me..) Rpad.
> >
> >
> >
> > Is there another solutions ?
> >
> >
> >
> > Thanks in advance
> >
> >
> >
> >
> >
> > Yohan C.
> >
> >
> >
> >
> ----------------------------------------------------------------------
> > Ce message est confidentiel. Son contenu ne represente en
> aucun cas un
> > engagement de la part du Groupe Soft Computing sous reserve de tout
> > accord conclu par ecrit entre vous et le Groupe Soft
> Computing. Toute
> > publication, utilisation ou diffusion, meme partielle, doit etre
> > autorisee prealablement.
> > Si vous n'etes pas destinataire de ce message, merci d'en avertir
> > immediatement l'expediteur.
> > This message is confidential. Its content does not constitute a
> > commitment by Soft Computing Group except where provided for in a
> > written agreement between you and Soft Computing Group. Any
> unauthorised
> > disclosure, use or dissemination, either whole or partial, is
> > prohibited. If you are not the intended recipient of this message,
> > please notify the sender immediately.
> >
> ----------------------------------------------------------------------
> >
> >
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Fri Jul 21 00:19:35 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Fri 21 Jul 2006 - 02:20:13 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.