[R] RE : how to use large data set ?

From: Yohan CHOUKROUN <YCH_at_softcomputing.com>
Date: Thu 20 Jul 2006 - 18:10:09 EST


Thank you for your answers.
So, I have to derive variables, reclassify, binning, aggregate, or merge variables (data management). After that I have to do a logistic regression to do scoring.
I have tried a MySQL database with RODBC but I did a mistake, I have 33 millions rows table and not 3 millions and the computer crashed.. So I will try your program.

Thank you

Yohan  

-----Message d'origine-----

De : Greg Snow [mailto:Greg.Snow@intermountainmail.org] Envoyé : mercredi 19 juillet 2006 21:58
À : Yohan CHOUKROUN; r-help@stat.math.ethz.ch Objet : RE: [R] how to use large data set ?

You did not say what analysis you want to do, but many common analyses can be done as special cases of regression models and you can use the biglm package to do regression models.

Here is an example that worked for me to get the mean and standard deviation by day from an oracle database with over 23 million rows (I had previously set up 'edw' as an odbc connection to the database under widows, any of the database connections packages should work for you though):

library(RODBC)
library(biglm)

con <- odbcConnect('edw',uid='glsnow',pwd=pass)

odbcQuery(con, "select ADMSN_WEEKDAY_CD, LOS_DYS from CM.CASEMIX_SMRY")

t1 <- Sys.time()

tmp <- sqlGetResults(con, max=100000)

names(tmp) <- c("Day","LoS")
tmp$Day <- factor(tmp$Day, levels=as.character(1:7)) tmp <- na.omit(tmp)
tmp <- subset(tmp, LoS > 0)

ff <- log(LoS) ~ Day

fit <- biglm(ff, tmp)

i <- nrow(tmp)
while( !is.null(nrow( tmp <- sqlGetResults(con, max=100000) ) ) ){

	names(tmp) <- c("Day","LoS")
	tmp$Day <- factor(tmp$Day, levels=as.character(1:7))
	tmp <- na.omit(tmp)
	tmp <- subset(tmp, LoS > 0)

	fit <- update(fit,tmp)
	
	i <- i + nrow(tmp)
	cat(format(i,big.mark=',')," rows processed\n")
}

summary(fit)

t2 <- Sys.time()

t2-t1  

Hope this helps,

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow@intermountainmail.org
(801) 408-8111
 


-----Original Message-----
From: r-help-bounces@stat.math.ethz.ch [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of Yohan CHOUKROUN Sent: Wednesday, July 19, 2006 9:42 AM To: 'r-help@stat.math.ethz.ch' Subject: [R] how to use large data set ? Hello R users, Sorry for my English, i'm French. I want to use a large dataset (3 millions of rows and 70 var) but I don't know how to do because my computer crash quickly (P4 2.8Ghz, 1Go ). I have also a bi Xeon with 2Go so I want to do computation on this computer and show the results on mine. Both of them are on Windows XP...
To do shortly I have: 1 server with a MySQL database 1computer and I want to use them with a large dataset. I'm trying to use RDCOM to connect the database and installing (but it's hard for me..) Rpad. Is there another solutions ? Thanks in advance Yohan C.
----------------------------------------------------------------------
Ce message est confidentiel. Son contenu ne represente en aucun cas un engagement de la part du Groupe Soft Computing sous reserve de tout accord conclu par ecrit entre vous et le Groupe Soft Computing. Toute publication, utilisation ou diffusion, meme partielle, doit etre autorisee prealablement. Si vous n'etes pas destinataire de ce message, merci d'en avertir immediatement l'expediteur. This message is confidential. Its content does not constitute a commitment by Soft Computing Group except where provided for in a written agreement between you and Soft Computing Group. Any unauthorised disclosure, use or dissemination, either whole or partial, is prohibited. If you are not the intended recipient of this message, please notify the sender immediately.
----------------------------------------------------------------------
[[alternative HTML version deleted]] ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
----------------------------------------------------------------------
Ce message est confidentiel. Son contenu ne represente en aucun cas un engagement de la part du Groupe Soft Computing sous reserve de tout accord conclu par ecrit entre vous et le Groupe Soft Computing. Toute publication, utilisation ou diffusion, meme partielle, doit etre autorisee prealablement. Si vous n'etes pas destinataire de ce message, merci d'en avertir immediatement l'expediteur. This message is confidential. Its content does not constitute a commitment by Soft Computing Group except where provided for in a written agreement between you and Soft Computing Group. Any unauthorised disclosure, use or dissemination, either whole or partial, is prohibited. If you are not the intended recipient of this message, please notify the sender immediately.
----------------------------------------------------------------------
[[alternative HTML version deleted]]

______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

Received on Thu Jul 20 18:20:21 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 20 Jul 2006 - 20:17:45 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.