Re: [R] reading in data with variable length

From: John McHenry <john_d_mchenry_at_yahoo.com>
Date: Wed 07 Dec 2005 - 02:34:23 EST


I should have mentioned that I already tried the readLines() approach:    

  ta<-readLines("foo.csv")
ptm<-proc.time()
f<-character(length(ta))
for (k in 2:length(ta)) { f[k-1]<-(strsplit(ta[k],",")[[1]])[3] }# <- PARSING EACH LINE AT THIS LEVEL IS WHERE THE REAL INEFFICIENCY IS (proc.time()-ptm)[3]
[1] 102.75

  on a 62M file, so I'm guessing that on my 1GB files this will be about    

  > (102.75*(1000/61))/60
[1] 28.07377   

minutes...which is way, way too long.    

  I'm new to R but I'm kind of surprised that this problem isn't well known (couldn't find anything after a long hunt).    

  As I mentioned, MATLAB does it using textread which makes a call to its dll dataread. The data are read using something like:    

  [name, startMonth, data]=textread(fileName,'%s%n%[^\n]', 'delimiter',',', 'bufsize', 1000000, 'headerlines',1);    

  which is kind of fscanf-like. data in the above is then a cell array with each cell being the variable-length data.

"Liaw, Andy" <andy_liaw@merck.com> wrote:   Use file() connection in conjunction with readLines() and strsplit() should do it. I would try to count the number of lines in the file first, and create a list with that many components, then fill it in. I believe the "array of cells" in Matlab is sort of equivalent to a list in R, but that's beyond my knowledge of Matlab...

Andy

From: John McHenry
>
> I have very large csv files (up to 1GB each of ASCII text).
> I'd like to be able to read them directly in to R. The
> problem I am having is with the variable length of the data
> in each record.
>
> Here's a (simplified) example:
>
> $ cat foo.csv
> Name,Start Month,Data
> Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955
> Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.854
> 6,0.2696,0.3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114
>
> The records consist of rows with some set comma-separated
> fields (e.g. the "Name" & "Start Month" fields in the above)
> and then the data follow as a variable-length list of
> comma-separated values until a new line is encountered.
>
> Now I can use e.g.
>
> fileName="foo.csv"
> ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T)
>
> which does the job nicely:
>
> V1 V2 V3 V4 V5 V6 V7 V8 V9
> V10 V11 V12 V13 V14 V15 V16 V17
> 1 Foo 10 -0.5615 2.3065 0.1589 -0.3649 1.5955 NA NA
> NA NA NA NA NA NA NA NA
> 2 Bar 21 0.0880 0.5733 0.0081 2.0253 -0.7602 0.7765 0.281
> 1.8546 0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114
>
>
> but the problem is with files on the order of 1GB this
> either crunches for ever or runs out of memory trying ...
> plus having all those NAs isn't too pretty to look at.
>
> (I have a MATLAB version that can read this stuff into an
> array of cells in about 3 minutes).
>
> I really want a fast way to read the data part into a list;
> that way I can access data in the array of lists containing
> the records by doing something ta[[i]]$data.
>
> Ideas?
>
> Thanks,
>
> Jack.
>
>
> ---------------------------------
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>
>



                   

        [[alternative HTML version deleted]]



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Wed Dec 07 03:46:50 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:41:29 EST