Re: [R] reading in data with variable length

From: John McHenry <john_d_mchenry_at_yahoo.com>
Date: Wed 07 Dec 2005 - 07:04:50 EST


  Everything has slowed down with #1 and #3 by about 50%. Can't do #2 & #4 :    

  > ta.num <- lapply(ta0, scan, sep = ",") Error in file(file, "r") : unable to open connection

  scan seems to want a file or a connection ...

Gabor Grothendieck <ggrothendieck@gmail.com> wrote:   Could you time these and see how each of these do:

# 1
ta.split <- strsplit(ta, split = ",")
ta.num <- lapply(ta.split, function(x) as.numeric(x[-(1:2)]))

# 2
ta0 <- sub("^[^,]*,[^.]*,", "", ta)
ta.num <- lapply(ta0, scan, sep = ",")

# 3 - loop version of #1
n <- length(ta)
ta.split <- strsplit(ta, split = ",")
ta.num <- list(length = n)
for(i in 1:n) ta.num[[i]] <- as.numeric(ta.split[[i]][-(1:2)])

# 4 - loop version of #2
n <- length(ta)
ta0 <- sub("^[^,]*,[^.]*,", "", ta)
ta.num <- list(length = n)
for(i in 1:n) ta.num[[i]] <- scan(t0[[i])

On 12/6/05, John McHenry wrote:
> I should have mentioned that I already tried the readLines() approach:
>
> ta<-readLines("foo.csv")
> ptm<-proc.time()
> f<-character(length(ta))
> for (k in 2:length(ta)) { f[k-1]<-(strsplit(ta[k],",")[[1]])[3] }# <- PARSING EACH LINE AT THIS LEVEL IS WHERE THE REAL INEFFICIENCY IS
> (proc.time()-ptm)[3]
> [1] 102.75
>
> on a 62M file, so I'm guessing that on my 1GB files this will be about
>
> > (102.75*(1000/61))/60
> [1] 28.07377
>
> minutes...which is way, way too long.
>
> I'm new to R but I'm kind of surprised that this problem isn't well known (couldn't find anything after a long hunt).
>
> As I mentioned, MATLAB does it using textread which makes a call to its dll dataread. The data are read using something like:
>
> [name, startMonth, data]=textread(fileName,'%s%n%[^\n]', 'delimiter',',', 'bufsize', 1000000, 'headerlines',1);
>
> which is kind of fscanf-like. data in the above is then a cell array with each cell being the variable-length data.
>
> "Liaw, Andy" wrote:
> Use file() connection in conjunction with readLines() and strsplit() should
> do it. I would try to count the number of lines in the file first, and
> create a list with that many components, then fill it in. I believe the
> "array of cells" in Matlab is sort of equivalent to a list in R, but that's
> beyond my knowledge of Matlab...
>
> Andy
>
> From: John McHenry
> >
> > I have very large csv files (up to 1GB each of ASCII text).
> > I'd like to be able to read them directly in to R. The
> > problem I am having is with the variable length of the data
> > in each record.
> >
> > Here's a (simplified) example:
> >
> > $ cat foo.csv
> > Name,Start Month,Data
> > Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955
> > Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.854
> > 6,0.2696,0.3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114
> >
> > The records consist of rows with some set comma-separated
> > fields (e.g. the "Name" & "Start Month" fields in the above)
> > and then the data follow as a variable-length list of
> > comma-separated values until a new line is encountered.
> >
> > Now I can use e.g.
> >
> > fileName="foo.csv"
> > ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T)
> >
> > which does the job nicely:
> >
> > V1 V2 V3 V4 V5 V6 V7 V8 V9
> > V10 V11 V12 V13 V14 V15 V16 V17
> > 1 Foo 10 -0.5615 2.3065 0.1589 -0.3649 1.5955 NA NA
> > NA NA NA NA NA NA NA NA
> > 2 Bar 21 0.0880 0.5733 0.0081 2.0253 -0.7602 0.7765 0.281
> > 1.8546 0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114
> >
> >
> > but the problem is with files on the order of 1GB this
> > either crunches for ever or runs out of memory trying ...
> > plus having all those NAs isn't too pretty to look at.
> >
> > (I have a MATLAB version that can read this stuff into an
> > array of cells in about 3 minutes).
> >
> > I really want a fast way to read the data part into a list;
> > that way I can access data in the array of lists containing
> > the records by doing something ta[[i]]$data.
> >
> > Ideas?
> >
> > Thanks,
> >
> > Jack.
> >
> >
> > ---------------------------------
> >
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
> >
>
>
> ------------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
>
>
>
>
> ---------------------------------
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
                        


        [[alternative HTML version deleted]]



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Wed Dec 07 07:37:25 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:41:30 EST