RE: [R] R versus SAS: lm performance

About this list Date view Thread view Subject view Author view Attachment view

From: Liaw, Andy (andy_liaw@merck.com)
Date: Tue 11 May 2004 - 22:19:40 EST


Message-id: <3A822319EB35174CA3714066D590DCD504AF7D6D@usrymx25.merck.com>

I tried the following on an Opteron 248, R-1.9.0 w/Goto's BLAS:

> y <- matrix(rnorm(14000*1344), 1344)
> x <- matrix(runif(1344*503),1344)
> system.time(fit <- lm(y~x))
[1] 106.00 55.60 265.32 0.00 0.00

The resulting fit object is over 600MB. (The coefficient compoent is a 504
x 14000 matrix.)

If I'm not mistaken, SAS sweeps on the extended cross product matrix to fit
regression models. That, I believe, in usually faster than doing QR
decomposition on the model matrix itself, but there are trade-offs. You
could try what Prof. Bates suggested.

Andy

> From: Arne.Muller@aventis.com
>
> Hello,
>
> thanks for your reply. I've now done the profiling, and I
> interpret that the most time is spend in the fortran routine(s):
>
> Each sample represents 0.02 seconds.
> Total run time: 920.219999999453 seconds.
>
> Total seconds: time spent in function and callees.
> Self seconds: time spent in function alone.
>
> % total % self
> total seconds self seconds name
> 100.00 920.22 0.02 0.16 "lm"
> 99.96 919.88 0.10 0.88 "lm.fit"
> 99.74 917.84 99.74 917.84 ".Fortran"
> 0.07 0.66 0.02 0.14 "storage.mode<-"
> 0.06 0.52 0.00 0.00 "eval"
> 0.06 0.52 0.04 0.34 "as.double"
> 0.02 0.22 0.02 0.22 "colnames<-"
> 0.02 0.20 0.02 0.20 "structure"
> 0.02 0.18 0.02 0.18 "model.matrix.default"
> 0.02 0.18 0.02 0.18 "as.double.default"
> 0.02 0.18 0.00 0.00 "model.matrix"
> 0.01 0.08 0.01 0.08 "list"
>
> % self % total
> self seconds total seconds name
> 99.74 917.84 99.74 917.84 ".Fortran"
> 0.10 0.88 99.96 919.88 "lm.fit"
> 0.04 0.34 0.06 0.52 "as.double"
> 0.02 0.22 0.02 0.22 "colnames<-"
> 0.02 0.20 0.02 0.20 "structure"
> 0.02 0.18 0.02 0.18 "as.double.default"
> 0.02 0.18 0.02 0.18 "model.matrix.default"
> 0.02 0.16 100.00 920.22 "lm"
> 0.02 0.14 0.07 0.66 "storage.mode<-"
> 0.01 0.08 0.01 0.08 "list"
>
> I guess this actually means I cannot do anything about it ...
> other than maybe splitting the problem into different
> (independaent parts - which I actually may be able to).
>
> Regarding the usage of lm.fit instead of lm, this might be a
> good idea, since I am using the same model.matrix for all
> fits! However, I'd need to recreate an lm object from the
> output, because I'd like to run the anova function on this.
> I'll first do some profiling on lm versus lm.fit for the
> 12,000 models ...
>
> kind regards + thanks again for your help,
>
> Arne
>
> --
> Arne Muller, Ph.D.
> Toxicogenomics, Aventis Pharma
> arne dot muller domain=aventis com
>
> > -----Original Message-----
> > From: Prof Brian Ripley [mailto:ripley@stats.ox.ac.uk]
> > Sent: 11 May 2004 09:08
> > To: Muller, Arne PH/FR
> > Cc: r-help@stat.math.ethz.ch
> > Subject: Re: [R] R versus SAS: lm performance
> >
> >
> > The way to time things in R is system.time().
> >
> > Without knowing much more about your problem we can only
> > guess where R is
> > spending the time. But you can find out by profiling -- see
> > `Writing R
> > Extensions'.
> >
> > If you want multiple fits with the same design matrix (do you?) you
> > could look at the code of lm and call lm.fit repeatedly yourself.
> >
> > On Mon, 10 May 2004 Arne.Muller@aventis.com wrote:
> >
> > > Hello,
> > >
> > > A collegue of mine has compared the runtime of a linear
> > model + anova in SAS and S+. He got the same results, but SAS
> > took a bit more than a minute whereas S+ took 17 minutes.
> > I've tried it in R (1.9.0) and it took 15 min. Neither
> > machine run out of memory, and I assume that all machines
> > have similar hardware, but the S+ and SAS machines are on
> > windows whereas the R machine is Redhat Linux 7.2.
> > >
> > > My question is if I'm doing something wrong (technically)
> > calling the lm routine, or (if not), how I can optimize the
> > call to lm or even using an alternative to lm. I'd like to
> > run about 12,000 of these models in R (for a gene expression
> > experiment - one model per gene, which would take far too long).
> > >
> > > I've run the follwong code in R (and S+):
> > >
> > > > options(contrasts=c('contr.helmert', 'contr.poly'))
> > >
> > > The 1st colum is the value to be modeled, and the others
> > are factors.
> > >
> > > > names(df.gene1data) <- c("Va", "Ba", "Ti", "Do", "Ar", "Pr")
> > > > df[c(1:2,1343:1344),]
> > > Va Do Ti Ba Ar Pr
> > > 1 2.317804 000mM 24h NEW 1 1
> > > 2 2.495390 000mM 24h NEW 2 1
> > > 8315 2.979641 025mM 04h PRG 83 16
> > > 8415 4.505787 000mM 04h PRG 84 16
> > >
> > > this is a dataframe with 1344 rows.
> > >
> > > x <- Sys.time();
> > > wlm <- lm(Va ~
> > >
> > Ba+Ti+Do+Pr+Ba:Ti+Ba:Do+Ba:Pr+Ti:Do+Ti:Pr+Do:Pr+Ba:Ti:Do+Ba:Ti
> > :Pr+Ba:Do:Pr+Ti:Do:Pr+Ba:Ti:Do:Pr+(Ba:Ti:Do)/Ar, data=df,
> singular=T);
> > > difftime(Sys.time(), x)
> > >
> > > Time difference of 15.33333 mins
> > >
> > > > anova(wlm)
> > > Analysis of Variance Table
> > >
> > > Response: Va
> > > Df Sum Sq Mean Sq F value Pr(>F)
> > > Ba 2 0.1 0.1 0.4262 0.653133
> > > Ti 1 2.6 2.6 16.5055 5.306e-05 ***
> > > Do 4 6.8 1.7 10.5468 2.431e-08 ***
> > > Pr 15 5007.4 333.8 2081.8439 < 2.2e-16 ***
> > > Ba:Ti 2 3.2 1.6 9.8510 5.904e-05 ***
> > > Ba:Do 7 2.8 0.4 2.5054 0.014943 *
> > > Ba:Pr 30 80.6 2.7 16.7585 < 2.2e-16 ***
> > > Ti:Do 4 8.7 2.2 13.5982 9.537e-11 ***
> > > Ti:Pr 15 2.4 0.2 1.0017 0.450876
> > > Do:Pr 60 10.2 0.2 1.0594 0.358551
> > > Ba:Ti:Do 7 1.4 0.2 1.2064 0.296415
> > > Ba:Ti:Pr 30 5.6 0.2 1.1563 0.259184
> > > Ba:Do:Pr 105 14.2 0.1 0.8445 0.862262
> > > Ti:Do:Pr 60 14.8 0.2 1.5367 0.006713 **
> > > Ba:Ti:Do:Pr 105 15.8 0.2 0.9382 0.653134
> > > Ba:Ti:Do:Ar 56 26.4 0.5 2.9434 2.904e-11 ***
> > > Residuals 840 134.7 0.2
> > >
> > > The corresponding SAS program from my collegue is:
> > >
> > > proc glm data = "the name of the data set";
> > >
> > > class B T D A P;
> > >
> > > model V = B T D P B*T B*D B*P T*D T*P D*P B*T*D B*T*P B*D*P
> > T*D*P B*T*D*P A(B*T*D);
> > >
> > > run;
> > >
> > > Note, V = Va, B = Ba, T = Ti, D = Do, P = Pr, A = Ar of the
> > R-example
> >
> > --
> > Brian D. Ripley, ripley@stats.ox.ac.uk
> > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> > University of Oxford, Tel: +44 1865 272861 (self)
> > 1 South Parks Road, +44 1865 272866 (PA)
> > Oxford OX1 3TG, UK Fax: +44 1865 272595
> >
> >
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>
>

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


About this list Date view Thread view Subject view Author view Attachment view

This archive was generated by hypermail 2.1.3 : Mon 31 May 2004 - 23:05:09 EST