Re: [R] Suggestion for big files [was: Re: A comment about R:]

From: Kort, Eric <Eric.Kort_at_vai.org>
Date: Fri 06 Jan 2006 - 03:09:47 EST


> -----Original Message-----
>
> [ronggui]
>
> >R's week when handling large data file. I has a data file : 807 vars,
> >118519 obs.and its CVS format. Stata can read it in in 2 minus,but In
> >my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
>
> Just (another) thought. I used to use SPSS, many, many years ago, on
> CDC machines, where the CPU had limited memory and no kind of paging
> architecture. Files did not need to be very large for being too large.
>
> SPSS had a feature that was then useful, about the capability of
> sampling a big dataset directly at file read time, quite before
> processing starts. Maybe something similar could help in R (that is,
> instead of reading the whole data in memory, _then_ sampling it.)
>
> One can read records from a file, up to a preset amount of them. If the
> file happens to contain more records than that preset number (the number
> of records in the whole file is not known beforehand), already read
> records may be dropped at random and replaced by other records coming
> from the file being read. If the random selection algorithm is properly
> chosen, it can be made so that all records in the original file have
> equal probability of being kept in the final subset.
>
> If such a sampling facility was built right within usual R reading
> routines (triggered by an extra argument, say), it could offer
> a compromise for processing large files, and also sometimes accelerate
> computations for big problems, even when memory is not at stake.
>

Since I often work with images and other large data sets, I have been thinking about a "BLOb" (binary large object--though it wouldn't necessarily have to be binary) package for R--one that would handle I/O for such creatures and only bring as much data into the R space as was actually needed.

So I see 3 possibilities:

  1. The sort of functionality you describe is implemented in the R internals (by people other than me).
  2. Some individuals (perhaps myself included) write such a package.
  3. This thread fizzles out and we do nothing.

I guess I will see what, if any, discussion ensues from this point to see which of these three options seems worth pursuing.

> --
> François Pinard http://pinard.progiciels-bpi.ca
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-
> guide.html
This email message, including any attachments, is for the so...{{dropped}}



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Fri Jan 06 03:14:37 2006

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:41:51 EST