**From:** *Arne.Muller@aventis.com*

**Date:** Sat 15 May 2004 - 09:44:55 EST

**Next message:**Gabor Grothendieck: "Re: [R] xtable without rownames"**Previous message:**Jeff D. Hamann: "Re: [R] xtable without rownames"**Next in thread:**Bill Vinyard: "RE: [R] help with memory greedy storage"**Reply:**Bill Vinyard: "RE: [R] help with memory greedy storage"

Message-id: <C80ECAFA2ACC1B45BE45D133ED660ADE010BF1E0@crbsmxsusr04.pharma.aventis.com>

Hello,

I've a problem with a self written routine taking a lot of memory (>1.2Gb). Maybe you can suggest some enhancements, I'm pretty sure that my implementation is not optimal ...

I'm creating many linear models and store coefficients, anova p-values ... all I need in different lists which are then finally returned in a list (list of lists).

The input is a matrix with 84 rows and >100,000 rows. The routine probeDf below creates a data frame that assigns the 84 rows to the different factors, but not just for one row but for several rows, depending what which(rows == g),] returns, and a new factor ('probe') is generated. This results in a 1344 by 6 data frame.

Example data frame returned by probeDf:

Value batch time dose array probe

1 2.317804 NEW 24h 000mM 1 1

2 2.495390 NEW 24h 000mM 2 1

3 2.412247 NEW 24h 000mM 3 1

...

144 8.851469 OLD 04h 100mM 60 2

145 8.801430 PRG 24h 000mM 61 2

146 8.308224 PRG 24h 000mM 62 2

...

This data frame is not the problem since, it gets generated on-the-fly per gene and is discarded afterwards (just that it takes some time to generate it).

Here comes the problematic routine:

### emat: matrix, model: formular for lm, contr: optional contrasts

probe.fit <- function(emat, factors, model, contr=NULL)

{

rows <- rownames(emat)

genes <- unique(rows)

l <- length(genes)

### generate proper lables (names) for the anova p-values

difflabels <- attr(terms(model),"term.labels")

aov <- list() # anova p-values for factors + interactions

coef <- list() # lm coefficients

coefp <- list() # p-valuies for coefficients

rsq <- list() # R-squared of fit

fitted <- list() # fitted values

value <- list() # orig. values (used with fitted to get residuals)

for ( g in genes ) { # loop over >12,000 genes

### g is the name that identifies 14 to 16 rows in emat

### d is the data frame for the lm

d <- probeDf(emat[which(rows == g),], facts)

fit <- lm(model, data = d, contrasts=contr)

fit.sum <- summary(fit)

aov[[g]] <- as.vector(na.omit(anova(fit)$'Pr(>F)'))

names(aov[[g]]) <- difflabels

coef[[g]] <- coef(fit)[-1]

coefp[[g]] <- coef(fit.sum)[-1,'Pr(>|t|)']

rsq[[g]] <- fit.sum$'r.squared'

value[[g]] <- d$Value

fitted[[g]] <- fitted(fit)

}

list(aov=aov, coefs=coef, coefp=coefp, rsq=rsq,

fitted=fitted, values=values)

}

### create a data frame from a matrix (usually 16 rows and 84 columns)

### and a list of factors. Basically this repates the factors 16 times

### (for each row in the matrix). This results in a data frame with 84*16

### rows as many columns as there are factors + 2 (probe factor + value

### to be modeled later)

probeDf <- function(emat, facts) {

df <- NULL

n <- 1

nsamp <- ncol(emat)

for ( i in 1:nrow(emat) ) {

values <- c(t(emat[i,]))

df.new <- data.frame(Value = values, facts, probe = rep(n, nsamp))

n <- n + 1

if ( !is.null(df) ) {

df <- rbind(df, df.new)

} else {

df <- df.new

}

}

df$probe <- as.factor(df$probe)

df

}

If I remove coef, coefp, value and fitted from the loop in probe.fit the memory usage is moderate.

The problem is that each of the 12,000 genes contributes 148 coefficients (the model contains quite a few factors) and p-values, the fitted and value vectors are >1300 elements long. I couldn't find a more compact form of storage that I is still easy to explore afterwards.

Suggestions on how to get this done more efficiently (in terms of memory) are greatfully received.

kind regards,

Arne

-- Arne Muller, Ph.D. Toxicogenomics, Aventis Pharma arne dot muller domain=aventis com______________________________________________ R-help@stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

**Next message:**Gabor Grothendieck: "Re: [R] xtable without rownames"**Previous message:**Jeff D. Hamann: "Re: [R] xtable without rownames"**Next in thread:**Bill Vinyard: "RE: [R] help with memory greedy storage"**Reply:**Bill Vinyard: "RE: [R] help with memory greedy storage"

*
This archive was generated by hypermail 2.1.3
: Mon 31 May 2004 - 23:05:11 EST
*