Re: [R] data input strategy - lots of csv files

From: Liaw, Andy <andy_liaw_at_merck.com>
Date: Thu 11 May 2006 - 20:49:41 EST


This is what I would try:

csvlist <- list.files(pattern="csv$")
bigblob <- lapply(csvlist, read.csv, ...) ## Get all dates that appear in any one of them. all.dates <- unique(unlist(lapply(bigblob, "[[", 1))) bigdata <- matrix(NA, length(all.dates), length(bigblob)) dimnames(bigdata) <- list(all.dates, whatevercolnamesyouwant) ## loop through bigblob and populate corresponding columns ## of bigmatrix with the matching dates. for (i in seq(along=bigblob)) {

    bigmatrix[as.character(bigblob[[i]][, 1]), i] <-

        bigblob[[i]][, columnwithdata]
}

This is obviously untested, so hope it's of some help.

Andy

From: Sean O'Riordain
>
> Good morning,
> I have currently 63 .csv files most of which have lines which
> look like
> 01/06/05,23445
> Though some files have two numbers beside each date. There
> are missing values, and currently the longest file has 318 rows.
>
> (merge() is losing the head and doing runaway memory
> allocation - but thats another question - I'm still trying to
> pin that issue down and make a small repeatable example)
>
> Currently I'm reading in these files with lines like
> a1 <- read.csv("daft_file_name_1.csv",header=F)
> ...
> a63 <- read.csv("another_silly_filename_63.csv",header=F)
>
> and then i'm naming the columns in these like...
> names(a1)[2] <- "silly column name"
> ...
> names(a63)[2] <- "daft column name"
>
> then trying to merge()...
> atot <- merge(a1, a2, all=T)
> and then using language manipulation to loop
> atot <- merge(atot, a3, all=T)
> ...
> atot <- merge(atot, a63, all=T)
> etc...
>
> followed by more language manipulation
> for() {
> rm(a1)
> } etc...
>
> i.e.
> for (i in 2:63) {
> atot <- merge(atot, eval(parse(text=paste("a", i,
> sep=""))), all=T)
> # eval(parse(text=paste("a",i,"[1] <- NULL",sep="")))
>
> cat("i is ", i, gc(), "\n")
>
> # now delete these 63 temporary objects...
> # e.g. should look like rm(a33)
> eval(parse(text=paste("rm(a",i,")", sep=""))) }
>
> eventually getting a dataframe with the first column being
> the date, and the subsequent 63 columns being the data...
> with missing values coded as NA...
>
> so my question is... is there a better strategy for reading
> in lots of small files (only a few kbytes each) like that
> which are timeseries with missing data... which doesn't go
> through the above awkwardness (and language manipulation) but
> still ends up with a nice data.frame with NA values correctly
> coded etc.
>
> Many thanks,
> Sean O'Riordain
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Thu May 11 20:55:14 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 11 May 2006 - 22:10:05 EST.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.