Re: [R] Mean/SD of Each Position in Table

From: Dennis Murphy <djmuser_at_gmail.com>
Date: Sun, 01 May 2011 11:53:12 -0700

Hi:

I would do something like the following:

(1) Create a vector of the file names.
(2) Use lapply() to read the files into a list.
(3) Use the reshape or reshape2 package to melt the individual files
into 'long' form.
(4) rbind together the resulting data frames. (5) Use a summarization function to generate the means and standard deviations.

I created three data frames that have the structure you provided below and wrote them out to csv files. The following code creates a vector of file names, then uses lapply() to read the data files consecutively and assign them to components of a list,. Next, I create a small utility function that uses the reshape2 package to melt the data into 'long form'. The ldply function from package plyr is then called to apply the function to each file and then to bind them all together into a single data frame. Finally, the ddply() function in plyr is used to get the mean and standard deviation for each time/substance combination.

#### Code to create test files for the example
# File creation for test files:

ds_create <- function() {

   times <- paste('Time', 1:10, sep = '')    cnames <- paste('Substance', 1:5, sep = '')    m <- matrix(rpois(50, 7), nrow = 10)
   colnames(m) <- cnames
   m <- as.data.frame(m)
   m$Time <- times
   write.csv(m, file = paste(name, '.csv', sep = ''),

                  quote = FALSE, row.names = FALSE)
  }
nms <- paste('m', 1:3, sep = '')
sapply(nms, ds_create)
####

# Vector of file names

files <- paste('m', 1:3, '.csv', sep = '')
# Read the data frames into a list, where each data frame is a
separate component
filelst <- lapply(files, read.csv, header = TRUE)

library(plyr)
library(reshape2)
# Function to melt a generic data frame
f <- function(df) {

     melt.data.frame(df, id = 'Time', variable_name = 'Substance', value_name = 'y')

   }
# Apply the function to each component of the list and rbind the
results together
bigdf <- ldply(filelst, f)
# Obtain the mean and sd for each Time/Substance combination
bigsumm <- ddply(bigdf, .(Time, Substance), summarise, mean = mean(y), sd = sd(y))

# ----

Caveat: If you have the reshape package loaded, then at present the value_name = assignment will not go through and the name of the last variable will be 'value'. In that event, you can either rename 'value' to 'y' with
names(bigdf)[3] <- 'y'
or change 'y' to 'value' before you invoke ddply() on bigdf(). Check bigdf() with
head(bigdf)
to verify that the names expected are 'Time', 'Substance' and 'y' before running the last command.
# ----

The result I get is
> dim(bigsumm)
[1] 50 4
> head(bigsumm)

    Time Substance mean sd

1  Time1 Substance1 10.333333 2.516611
2  Time1 Substance2 10.666667 1.154701
3  Time1 Substance3  6.000000 2.645751
4  Time1 Substance4  6.333333 1.154701
5  Time1 Substance5  5.333333 1.527525
6 Time10 Substance1  4.666667 3.055050

The structure is what matters. You should be able to extend this template to your 100 data frames.

HTH,
Dennis

On Sun, May 1, 2011 at 8:48 AM, Nemergut, Edward *HS <EN3X_at_hscmail.mcc.virginia.edu> wrote:

> I have 100+ .csv files which have the basic format:
>
>> test
>        X Substance1 Substance2 Substance3 Substance4 Substance5
> 1   Time1         10          0          0          0          0
> 2   Time2          9          5          0          0          0
> 3   Time3          8         10          1          0          0
> 4   Time4          7         20          2          1          0
> 5   Time5          6         25          3          2          1
> 6   Time6          5         30          4          2          2
> 7   Time7          4         25          5          3          3
> 8   Time8          3         20          6          3          4
> 9   Time9          2         15          5          3          5
> 10 Time10          1         10          4          4          6
>
> Each table is of exactly the same dimensions.  After reading each of the
> 100+ .csv files into R, I want determine the mean and SD of each and every
> cell.  That is to ask, I to calculate the mean and SD for (Time1,Substance1)
> and every other cell from each of the 100+ .csv files.
>
> I imagine this is a fairly basic question, but my search has been
> unsuccessful.
>
> Thanks in advance,
> ECN
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu 05 May 2011 - 06:25:07 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 05 May 2011 - 07:00:06 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive