Re: [R] Suggestion for big files [was: Re: A comment about R:]

From: Wensui Liu <liuwensui_at_gmail.com>
Date: Sat 07 Jan 2006 - 02:23:58 EST

RG,

Actually, SQLite provides a solution to read *.csv file directly into db.

Just for your consideration.

On 1/5/06, ronggui <ronggui.huang@gmail.com> wrote:
>
> 2006/1/6, jim holtman <jholtman@gmail.com>:
> > If what you are reading in is numeric data, then it would require (807 *
> > 118519 * 8) 760MB just to store a single copy of the object -- more
> memory
> > than you have on your computer. If you were reading it in, then the
> problem
> > is the paging that was occurring.
> In fact,If I read it in 3 pieces, each is about 170M.
>
> >
> > You have to look at storing this in a database and working on a subset
> of
> > the data. Do you really need to have all 807 variables in memory at the
> > same time?
>
> Yip,I don't need all the variables.But I don't know how to get the
> necessary variables into R.
>
> At last I read the data in piece and use RSQLite package to write it
> to a database.and do then do the analysis. If i am familiar with
> database software, using database (and R) is the best choice,but
> convert the file into database format is not an easy job for me.I ask
> for help in SQLite list,but the solution is not satisfying as that
> required the knowledge about the third script language.After searching
> the internet,I get this solution:
>
> #begin
> rm(list=ls())
> f<-file("D:\wvsevs_sb_v4.csv","r")
> i <- 0
> done <- FALSE
> library(RSQLite)
> con<-dbConnect("SQLite","c:\sqlite\database.db3")
> tim1<-Sys.time()
>
> while(!done){
> i<-i+1
> tt<-readLines(f,2500)
> if (length(tt)<2500) done <- TRUE
> tt<-textConnection(tt)
> if (i==1) {
> assign("dat",read.table(tt,head=T,sep=",",quote=""));
> }
> else assign("dat",read.table(tt,head=F,sep=",",quote=""))
> close(tt)
> ifelse(dbExistsTable(con, "wvs"),dbWriteTable(con,"wvs",dat,append=T),
> dbWriteTable(con,"wvs",dat) )
> }
> close(f)
> #end
> It's not the best solution,but it works.
>
>
>
> > If you use 'scan', you could specify that you do not want some of the
> > variables read in so it might make a more reasonably sized objects.
> >
> >
> > On 1/5/06, François Pinard <pinard@iro.umontreal.ca> wrote:
> > > [ronggui]
> > >
> > > >R's week when handling large data file. I has a data file : 807
> vars,
> > > >118519 obs.and its CVS format. Stata can read it in in 2 minus,but
> In
> > > >my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
> > >
> > > Just (another) thought. I used to use SPSS, many, many years ago, on
> > > CDC machines, where the CPU had limited memory and no kind of paging
> > > architecture. Files did not need to be very large for being too
> large.
> > >
> > > SPSS had a feature that was then useful, about the capability of
> > > sampling a big dataset directly at file read time, quite before
> > > processing starts. Maybe something similar could help in R (that is,
> > > instead of reading the whole data in memory, _then_ sampling it.)
> > >
> > > One can read records from a file, up to a preset amount of them. If
> the
> > > file happens to contain more records than that preset number (the
> number
> > > of records in the whole file is not known beforehand), already read
> > > records may be dropped at random and replaced by other records coming
> > > from the file being read. If the random selection algorithm is
> properly
> > > chosen, it can be made so that all records in the original file have
> > > equal probability of being kept in the final subset.
> > >
> > > If such a sampling facility was built right within usual R reading
> > > routines (triggered by an extra argument, say), it could offer
> > > a compromise for processing large files, and also sometimes accelerate
> > > computations for big problems, even when memory is not at stake.
> > >
> > > --
> > > François Pinard http://pinard.progiciels-bpi.ca
> > >
> > > ______________________________________________
> > > R-help@stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> > >
> >
> >
> >
> > --
> > Jim Holtman
> > Cincinnati, OH
> > +1 513 247 0281
> >
> > What the problem you are trying to solve?
>
>
> --
> 黄荣贵
> Deparment of Sociology
> Fudan University
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html

--
WenSui Liu
(http://statcompute.blogspot.com)
Senior Decision Support Analyst
Health Policy and Clinical Effectiveness Cincinnati Children Hospital Medical Center

        [[alternative HTML version deleted]]



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sat Jan 07 02:36:37 2006

This archive was generated by hypermail 2.1.8 : Sat 07 Jan 2006 - 06:07:00 EST