RE: [R] Another big data size problem

From: Henrik Bengtsson <hb_at_maths.lth.se>
Date: Thu 29 Jul 2004 - 01:04:57 EST


> -----Original Message-----
> From: r-help-bounces@stat.math.ethz.ch
> [mailto:r-help-bounces@stat.math.ethz.ch] On Behalf Of
> Federico Gherardini
> Sent: Wednesday, July 28, 2004 5:26 PM
> To: r-help@stat.math.ethz.ch
> Subject: Re: [R] Another big data size problem
>
>
> On Wed, 28 Jul 2004 13:28:20 +0100
> Ernesto Jardim <ernesto@ipimar.pt> wrote:
>
>
> > Hi,
> >
> > When you're writing a table to MySQL you have to be carefull if the
> > table is created by RMySQL. The fields definition may not
> be the most
> > adequate and there will be no indexes in your table, which
> makes the
> > queries _very_ slow.
> >
> So, if I understood correctly, if you want to use SQL you'll
> have to upload the table in SQL, directly from MySQL without
> using R at all, and then use RMySQL to read the elements in R?
>
> Uwe Ligges <ligges@statistik.uni-dortmund.de> wrote:
>
> >Note that it is better to initialize the object to full size before
> >inserting -- rather than using rbind() and friends which is
> indeed slow
> >since it need to re-allocate much memory for each step.
>
> Do you mean something like this?
>
> tab <- matrix(rep(0, 1227 * 20000), 1227, 20000, byrow = TRUE)
>
> for(i in 0:num.lines)
> tab[i + 1,] <- scan(file=fh, nlines=1, what="PS", skip = i)

It is better to open a file connection, keep it open during the loop and the close it afterwards. Something like

  tab <- matrix(rep(0, 1227 * 20000), 1227, 20000, byrow = TRUE)   fh <- file(filename, open="r");
  for(i in 0:num.lines)
    tab[i + 1,] <- scan(file=fh, nlines=1);   close(fh);

As you have done it, the file is opened once in each iteration of the loop, scan() starts reading from the beginning, parse all lines to skip 'i' lines, and the reads one line. This is done num.lines+1 times!

Anyway, I think you also should read the help for scan(). What do you want with argument 'what="PS"'? "PS" is not a valid data type; 'what' does not specify a name of field/column to be read.

> The above doesn't get very far either... it seems that, once
> it has created the table, it becomes so slow that it's
> unusable. I'll have to try this with more RAM by the way.

My suggestions to you are that try read.table() with specified data type for the columns using vector argument 'colClasses'. This way you can help R by specify that, say, column 3 is an integer (have the memory of a double), and that column 6-10 are doubles. Unfortunately you can tell read.table() to skip some of the columns that you are not interested in, which in your case to help you out a lot. To do this, you have to use scan(), which read.table() uses internally. In scan() 'what' works similar to 'colClasses' *and* if you specify 'what' as a 'list' you can tell scan() to skip some columns by setting its 'what' value to NULL, e.g. what=list("integer", "integer", NULL, "double", "character"). I think you can get pretty far doing this!  

> Cheers,
>
> fede

Good luck!

Henrik Bengtsson



R-help@stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Thu Jul 29 01:30:55 2004

This archive was generated by hypermail 2.1.8 : Fri 18 Mar 2005 - 02:40:48 EST