[R] reading in data with variable length

From: John McHenry <john_d_mchenry_at_yahoo.com>
Date: Wed 07 Dec 2005 - 01:04:10 EST

I have very large csv files (up to 1GB each of ASCII text). I'd like to be able to read them directly in to R. The problem I am having is with the variable length of the data in each record.    

  Here's a (simplified) example:    

  $ cat foo.csv
Name,Start Month,Data
Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955 Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.8546,0.2696,0.3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114    

  The records consist of rows with some set comma-separated fields (e.g. the "Name" & "Start Month" fields in the above) and then the data follow as a variable-length list of comma-separated values until a new line is encountered.    

  Now I can use e.g.    

ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T)    

  which does the job nicely:    

     V1 V2      V3     V4     V5      V6      V7     V8    V9    V10    V11    V12    V13     V14     V15    V16     V17
1 Foo 10 -0.5615 2.3065 0.1589 -0.3649  1.5955     NA    NA     NA     NA     NA     NA      NA      NA     NA      NA
2 Bar 21 0.0880 0.5733 0.0081 2.0253 -0.7602 0.7765 0.281 1.8546 0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114    

  but the problem is with files on the order of 1GB this either crunches for ever or runs out of memory trying ... plus having all those NAs isn't too pretty to look at.    

  (I have a MATLAB version that can read this stuff into an array of cells in about 3 minutes).    

  I really want a fast way to read the data part into a list; that way I can access data in the array of lists containing the records by doing something ta[[i]]$data.    




        [[alternative HTML version deleted]]

R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Wed Dec 07 01:21:05 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:41:29 EST