Re: [R] Combining 4th column from 30 files

From: Henrik Bengtsson <hb_at_stat.berkeley.edu>
Date: Wed, 23 Jul 2008 11:25:03 -0700

A few things that will help you, if not now then in the future:

  1. Preallocate the result object. This allow you to avoid using cbind()/rbind(), which constantly creates a new large copy in each iteration. That will eventually bite you if you have a lot of data. In your case you know the number of files, but maybe not the number of rows, but that can be inferred in the first iteration.
  2. Read only the columns you need. This will save memory and speed up the reading, especially for large data files. In read.table() you can specify 'colClasses' and set it to "NULL" for unwanted columns. If you know the number of columns in each file, say it is 23, the do: colClasses <- rep("NULL", 23); colClasses[4] <- "double" (if it is doubles you are reading).

This is how I would do it. It works for small and rather large data sets.

pathnames <- dir(pattern="data");
nbrOfFiles <- length(pathnames);
colClasses <- rep("NULL", nbrOfFiles); colClasses[4] <- "double"; res <- NULL;
for (kk in seq(length=nbrOfFiles)) {
  pathname <- pathnames[kk];
  values <- read.table(pathname, colClasses=colClasses)[,1];   if (is.null(res)) {

     # Allocate a matrix of the same data type as the data read.
     res <- matrix(values[1], nrow=length(values), ncol=nbrOfFiles);
  }
  res[,kk] <- values;
  rm(values);
}

My $.02

/Henrik

On Wed, Jul 23, 2008 at 4:24 AM, Henrique Dallazuanna <wwwhsd_at_gmail.com> wrote:
> Maybe:
>
> sapply(lapply(dir(pattern="data"), read.table), '[[', 4)
>
> On Wed, Jul 23, 2008 at 5:21 AM, Daren Tan <daren76_at_hotmail.com> wrote:
>>
>> Better approach than this brute force ?
>>
>> mm <- NULL
>> for (i in dir(pattern="data")) { m <- readTable(i); mm <- cbind(mm, m[,4]) }
>> _________________________________________________________________
>>
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Henrique Dallazuanna
> Curitiba-Paraná-Brasil
> 25° 25' 40" S 49° 16' 22" O
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Wed 23 Jul 2008 - 18:30:00 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 23 Jul 2008 - 18:32:21 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive