From: Gabor Grothendieck <ggrothendieck_at_gmail.com>

Date: Wed 07 Dec 2005 - 04:55:12 EST

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Wed Dec 07 05:37:37 2005

Date: Wed 07 Dec 2005 - 04:55:12 EST

Could you time these and see how each of these do:

# 1

ta.split <- strsplit(ta, split = ",")

ta.num <- lapply(ta.split, function(x) as.numeric(x[-(1:2)]))

# 2

ta0 <- sub("^[^,]*,[^.]*,", "", ta)

ta.num <- lapply(ta0, scan, sep = ",")

# 3 - loop version of #1

n <- length(ta)

ta.split <- strsplit(ta, split = ",")

ta.num <- list(length = n)

for(i in 1:n) ta.num[[i]] <- as.numeric(ta.split[[i]][-(1:2)])

# 4 - loop version of #2

n <- length(ta)

ta0 <- sub("^[^,]*,[^.]*,", "", ta)

ta.num <- list(length = n)

for(i in 1:n) ta.num[[i]] <- scan(t0[[i])

On 12/6/05, John McHenry <john_d_mchenry@yahoo.com> wrote:

> I should have mentioned that I already tried the readLines() approach:

*>
**> ta<-readLines("foo.csv")
**> ptm<-proc.time()
**> f<-character(length(ta))
**> for (k in 2:length(ta)) { f[k-1]<-(strsplit(ta[k],",")[[1]])[3] }# <- PARSING EACH LINE AT THIS LEVEL IS WHERE THE REAL INEFFICIENCY IS
**> (proc.time()-ptm)[3]
**> [1] 102.75
**>
**> on a 62M file, so I'm guessing that on my 1GB files this will be about
**>
**> > (102.75*(1000/61))/60
**> [1] 28.07377
**>
**> minutes...which is way, way too long.
**>
**> I'm new to R but I'm kind of surprised that this problem isn't well known (couldn't find anything after a long hunt).
**>
**> As I mentioned, MATLAB does it using textread which makes a call to its dll dataread. The data are read using something like:
**>
**> [name, startMonth, data]=textread(fileName,'%s%n%[^\n]', 'delimiter',',', 'bufsize', 1000000, 'headerlines',1);
**>
**> which is kind of fscanf-like. data in the above is then a cell array with each cell being the variable-length data.
**>
**> "Liaw, Andy" <andy_liaw@merck.com> wrote:
**> Use file() connection in conjunction with readLines() and strsplit() should
**> do it. I would try to count the number of lines in the file first, and
**> create a list with that many components, then fill it in. I believe the
**> "array of cells" in Matlab is sort of equivalent to a list in R, but that's
**> beyond my knowledge of Matlab...
**>
**> Andy
**>
**> From: John McHenry
**> >
**> > I have very large csv files (up to 1GB each of ASCII text).
**> > I'd like to be able to read them directly in to R. The
**> > problem I am having is with the variable length of the data
**> > in each record.
**> >
**> > Here's a (simplified) example:
**> >
**> > $ cat foo.csv
**> > Name,Start Month,Data
**> > Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955
**> > Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.854
**> > 6,0.2696,0.3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114
**> >
**> > The records consist of rows with some set comma-separated
**> > fields (e.g. the "Name" & "Start Month" fields in the above)
**> > and then the data follow as a variable-length list of
**> > comma-separated values until a new line is encountered.
**> >
**> > Now I can use e.g.
**> >
**> > fileName="foo.csv"
**> > ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T)
**> >
**> > which does the job nicely:
**> >
**> > V1 V2 V3 V4 V5 V6 V7 V8 V9
**> > V10 V11 V12 V13 V14 V15 V16 V17
**> > 1 Foo 10 -0.5615 2.3065 0.1589 -0.3649 1.5955 NA NA
**> > NA NA NA NA NA NA NA NA
**> > 2 Bar 21 0.0880 0.5733 0.0081 2.0253 -0.7602 0.7765 0.281
**> > 1.8546 0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114
**> >
**> >
**> > but the problem is with files on the order of 1GB this
**> > either crunches for ever or runs out of memory trying ...
**> > plus having all those NAs isn't too pretty to look at.
**> >
**> > (I have a MATLAB version that can read this stuff into an
**> > array of cells in about 3 minutes).
**> >
**> > I really want a fast way to read the data part into a list;
**> > that way I can access data in the array of lists containing
**> > the records by doing something ta[[i]]$data.
**> >
**> > Ideas?
**> >
**> > Thanks,
**> >
**> > Jack.
**> >
**> >
**> > ---------------------------------
**> >
**> >
**> > [[alternative HTML version deleted]]
**> >
**> > ______________________________________________
**> > R-help@stat.math.ethz.ch mailing list
**> > https://stat.ethz.ch/mailman/listinfo/r-help
**> > PLEASE do read the posting guide!
**> > http://www.R-project.org/posting-guide.html
**> >
**> >
**>
**>
**> ------------------------------------------------------------------------------
**>
**> ------------------------------------------------------------------------------
**>
**>
**>
**>
**> ---------------------------------
**>
**> [[alternative HTML version deleted]]
**>
**> ______________________________________________
**> R-help@stat.math.ethz.ch mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-help
**> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
**>
*

R-help@stat.math.ethz.ch mailing list

https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Wed Dec 07 05:37:37 2005

*
This archive was generated by hypermail 2.1.8
: Fri 03 Mar 2006 - 03:41:30 EST
*