[R] Is aggregate() what I need here?

From: Tim Cutts <tjrc_at_sanger.ac.uk>
Date: Fri 04 Mar 2005 - 18:46:13 EST


I'm pretty new to R, and I've been given a script by a user who wants some help with it. I know enough about the way R works to know that this is a very inefficient way to do what the user wants (the LSB_JOBINDEX stuff is added by me so that this can work on many hundreds of input data files as LSF jobs - it's the nested loops I'm really interested in):

gtpfile<-paste("genotype_chunk.", sep="", Sys.getenv("LSB_JOBINDEX")) snps<-read.table(gtpfile,header=TRUE)
exp<-read.table("data.TXT",header=TRUE)
# the above two files have columns for individuals and snps (or genes)
for rows
# so the next two lines simply transpose these data matrices
tsnps<-t(snps)
texp<-t(exp)
sink(paste("output.", sep="", Sys.getenv("LSB_JOBINDEX")))
#loops below are hardwired for 5 gene-expression levels (some genes
have two
#probes, and those are treated as separate genes for now) and 100 SNPs.
for (iexp in 1:5){

    for (isnp in 1:100){
    genotype<-factor(tsnps[,isnp])
# make sure there is more than one genotypic class before doing
ANOVA
    if (length(unique(genotype))>1) {

       expression<-texp[,iexp]
       stuff<-anova(lm(expression ~ genotype))
       qq=c(iexp,isnp)
       print (qq)
       print(stuff)

# plot(factor(tsnps[,isnp]),texp[,iexp], xlab="Genotype",
ylab="Expression
  level")
    }
}
}
sink()

Eventually, on the full data set, the size of the two data sets being related to each other will be much larger - several hundred thousand rows in snps, and several hundred rows in exp.

Clearly, the genotype<-factor line is being calculated repeatedly, and in the wrong place; it should be done en masse up front once the data has been read, and I'm sure both loops could actually be expressed some other way.

It seems to me that I ought to be able to calculate all the factors in a statement before the iexp loop, presumably by using apply() or aggregate(), but apply() seems to coerce the result so the results aren't factors any more, and I'm clearly missing something with regard to the way aggregate() works; I really don't understand what I'd need to set the 'by' parameter to.

I apologise if this is a hopelessly naive question, but I've got both the "Introduction to R" and Peter's introductory statistics with R book here, and I'm having some difficulty finding the information I'm after. "Introduction to R" doesn't mention the word "aggregate" once... :-)

Thanks in advance,

Tim

-- 
Dr Tim Cutts
Informatics Systems Group, Wellcome Trust Sanger Institute
GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5  860B 3CDD 3F56 E313 4233

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Fri Mar 04 19:04:12 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:30:40 EST