Re: [R] Suggestion for big files [was: Re: A comment about R:]

From: Wensui Liu <liuwensui_at_gmail.com>
Date: Sat 07 Jan 2006 - 14:09:36 EST

RG,

I think .import command in sqlite should work. plus, sqlite browser ( http://sqlitebrowser.sourceforge.net) might do the work as well.

On 1/6/06, ronggui <ronggui.huang@gmail.com> wrote:
>
> Can you give me some hints? or let me know how to do ?
>
> Thank you !
>
> 2006/1/6, Wensui Liu <liuwensui@gmail.com>:
> > RG,
> >
> > Actually, SQLite provides a solution to read *.csv file directly into
> db.
> >
> > Just for your consideration.
> >
> >
> > On 1/5/06, ronggui <ronggui.huang@gmail.com> wrote:
> > > 2006/1/6, jim holtman <jholtman@gmail.com>:
> > > > If what you are reading in is numeric data, then it would require
> (807 *
> > > > 118519 * 8) 760MB just to store a single copy of the object -- more
> > memory
> > > > than you have on your computer. If you were reading it in, then the
> > problem
> > > > is the paging that was occurring.
> > > In fact,If I read it in 3 pieces, each is about 170M.
> > >
> > > >
> > > > You have to look at storing this in a database and working on a
> subset
> > of
> > > > the data. Do you really need to have all 807 variables in memory at
> the
> > > > same time?
> > >
> > > Yip,I don't need all the variables.But I don't know how to get the
> > > necessary variables into R.
> > >
> > > At last I read the data in piece and use RSQLite package to write it
> > > to a database.and do then do the analysis. If i am familiar with
> > > database software, using database (and R) is the best choice,but
> > > convert the file into database format is not an easy job for me.I ask
> > > for help in SQLite list,but the solution is not satisfying as that
> > > required the knowledge about the third script language.After searching
> > > the internet,I get this solution:
> > >
> > > #begin
> > > rm(list=ls())
> > > f<-file("D:\wvsevs_sb_v4.csv","r")
> > > i <- 0
> > > done <- FALSE
> > > library(RSQLite)
> > > con<-dbConnect("SQLite","c:\sqlite\database.db3")
> > > tim1<-Sys.time()
> > >
> > > while(!done){
> > > i<-i+1
> > > tt<-readLines(f,2500)
> > > if (length(tt)<2500) done <- TRUE
> > > tt<-textConnection(tt)
> > > if (i==1) {
> > > assign("dat",read.table(tt,head=T,sep=",",quote=""));
> > > }
> > > else assign("dat",read.table(tt,head=F,sep=",",quote=""))
> > > close(tt)
> > > ifelse(dbExistsTable(con,
> > "wvs"),dbWriteTable(con,"wvs",dat,append=T),
> > > dbWriteTable(con,"wvs",dat) )
> > > }
> > > close(f)
> > > #end
> > > It's not the best solution,but it works.
> > >
> > >
> > >
> > > > If you use 'scan', you could specify that you do not want some of
> the
> > > > variables read in so it might make a more reasonably sized objects.
> > > >
> > > >
> > > > On 1/5/06, François Pinard < pinard@iro.umontreal.ca> wrote:
> > > > > [ronggui]
> > > > >
> > > > > >R's week when handling large data file. I has a data file : 807
> > vars,
> > > > > >118519 obs.and its CVS format. Stata can read it in in 2
> minus,but
> > In
> > > > > >my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
> > > > >
> > > > > Just (another) thought. I used to use SPSS, many, many years ago,
> on
> > > > > CDC machines, where the CPU had limited memory and no kind of
> paging
> > > > > architecture. Files did not need to be very large for being too
> > large.
> > > > >
> > > > > SPSS had a feature that was then useful, about the capability of
> > > > > sampling a big dataset directly at file read time, quite before
> > > > > processing starts. Maybe something similar could help in R (that
> is,
> > > > > instead of reading the whole data in memory, _then_ sampling it.)
> > > > >
> > > > > One can read records from a file, up to a preset amount of
> them. If
> > the
> > > > > file happens to contain more records than that preset number (the
> > number
> > > > > of records in the whole file is not known beforehand), already
> read
> > > > > records may be dropped at random and replaced by other records
> coming
> > > > > from the file being read. If the random selection algorithm is
> > properly
> > > > > chosen, it can be made so that all records in the original file
> have
> > > > > equal probability of being kept in the final subset.
> > > > >
> > > > > If such a sampling facility was built right within usual R reading
> > > > > routines (triggered by an extra argument, say), it could offer
> > > > > a compromise for processing large files, and also sometimes
> accelerate
> > > > > computations for big problems, even when memory is not at stake.
> > > > >
> > > > > --
> > > > > François Pinard http://pinard.progiciels-bpi.ca
> > > > >
> > > > > ______________________________________________
> > > > > R-help@stat.math.ethz.ch mailing list
> > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > PLEASE do read the posting guide!
> > > > http://www.R-project.org/posting-guide.html
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jim Holtman
> > > > Cincinnati, OH
> > > > +1 513 247 0281
> > > >
> > > > What the problem you are trying to solve?
> > >
> > >
> > > --
> > > 黄荣贵
> > > Deparment of Sociology
> > > Fudan University
> > >
> > > ______________________________________________
> > > R-help@stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
> >
> >
> > --
> > WenSui Liu
> > (http://statcompute.blogspot.com)
> > Senior Decision Support Analyst
> > Health Policy and Clinical Effectiveness
> > Cincinnati Children Hospital Medical Center
> >
>
>
> --
> 黄荣贵
> Deparment of Sociology
> Fudan University
>

--
WenSui Liu
(http://statcompute.blogspot.com)
Senior Decision Support Analyst
Health Policy and Clinical Effectiveness Cincinnati Children Hospital Medical Center

        [[alternative HTML version deleted]]



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Sat Jan 07 14:17:08 2006

This archive was generated by hypermail 2.1.8 : Sat 07 Jan 2006 - 18:07:56 EST