Re: [R] Seeking a more efficient way to read in a file

From: Charilaos Skiadas <cskiadas_at_gmail.com>
Date: Wed, 2 Jan 2008 20:42:21 -0500

On Jan 2, 2008, at 6:05 PM, Talbot Katz wrote:

> Hi.
>
> I have a matrix stored in a large, tab-delimited flat file. The
> first row contains column names. Because the matrix is symmetric,
> the file has lower triangular format, so the second row contains
> one number, the third row two numbers, etc. In general, row k+1
> contains k numbers; the matrix has 3000 rows, so the file has 3001
> rows. The file has variable length records, so each row ends with
> its last piece of data. I read in the file and produced the full
> symmetric matrix as follows:
>
>> mana01 <- scan( file = "C:/mat.dat", sep = "\t", nlines = 1, what
>> = "character" )Read 3000 items> nco <- length( mana01 )> malt <-
>> matrix(0, nrow = nco, ncol = nco )> colnames( malt ) <- mana01>
>> rownames( malt ) <- mana01> for ( i in 1:3000 ) { malt[ i, (1:i) ]
>> <- scan( file="C:/mat.dat", skip = i, n = i, quiet = TRUE ) }
>> mat <- malt + t( malt ) - diag( diag( malt ) )>
>
> The for loop took a couple of hours to complete. I suspect there's
> a much faster way to do this. Any suggestions? Thanks!

I saw Jim's reply just after having just written a solution, so here is my take on it. The key thing, as Jim mentioned, is to not use scan each time, but to read the whole thing in and then process it. I read the lines, used strsplit to get a list of each individual line, and then used sapply after extending each row by the right number of zeros.

Not sure which of the two is faster.

nms <- scan("~/Desktop/testing.txt", sep="\t", nlines=1, what=character(0))
x <- scan("~/Desktop/testing.txt", sep="\n", skip=1, what=character (0)) # read as a vector of lines
splt <- strsplit(x,"\t") # split at the tabs nr <- length(nms)
splt <- sapply(splt, function(x) c(as.numeric(x), rep(0,nr-length (x)))) # extend each for by the right number of zeros.

Haris Skiadas
Department of Mathematics and Computer Science Hanover College



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 03 Jan 2008 - 01:44:38 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 03 Jan 2008 - 03:30:05 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive