Re: [R] handling big data set in R

From: ONKELINX, Thierry <>
Date: Mon, 03 Mar 2008 09:28:36 +0100

Dear Shu,

Why not store your dataset in a database? Then you can start each loop by reading the submatrix you need for the analysis. This will require much less memory. loops from the apply-family with work better than the for loop.

HTH, Thierry

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest
Cel biometrie, methodologie en kwaliteitszorg / Section biometrics, methodology and quality assurance
Gaverstraat 4
9500 Geraardsbergen
tel. + 32 54/436 185

Do not put your faith in what statistics say until you have carefully considered what they do not say. ~William W. Watt A statistical analysis, properly conducted, is a delicate dissection of uncertainties, a surgery of suppositions. ~M.J.Moroney

-----Oorspronkelijk bericht-----
Van: [] Namens shu zhang
Verzonden: maandag 3 maart 2008 6:35
Onderwerp: [R] handling big data set in R

Hello R users,

I'm wondering whether it is possible to manage big data set in R? I have a data set with 3 million rows and 3 columns (X,Y,Z), where X is the group id. For each X, I need to run 2 regression on the submatrix. I used the function "split":

datamatrix<-read.csv("datas.csv", header=F, sep=",") dim(datamatrix)
# [1] 2980523 3


subX<-split(X, X)

n<-length(subdata) ### number of groups s1<-s2<-rep(NA, n) ### vector to store the regression slope

for (i in 1:n){
  fit1<-lm(table.y~table.x) ##### find the slope of the histogram of y

  fit2<-lm(subY[[i]]~subZ[[i]]) ####### regress y on z   s2[i]<-fit$coefficients[2]

But my R died before completing the loop... (I've thought about doing it in SAS, but I don't know how to write a loop combined with a PROC REG...) One thing that might be helpful is that my data set has already been sorted based on X. I don't know whether this can be any helpful for managing the dataset.

Any suggestion would be appreciated!

-Shu mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. mailing list PLEASE do read the posting guide and provide commented, minimal, self-contained, reproducible code. Received on Mon 03 Mar 2008 - 08:30:42 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 03 Mar 2008 - 09:30:18 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.

list of date sections of archive