RE: [R] R versus SAS: lm performance

About this list Date view Thread view Subject view Author view Attachment view

From: Prof Brian Ripley (ripley@stats.ox.ac.uk)
Date: Tue 11 May 2004 - 22:15:42 EST


Message-id: <Pine.LNX.4.44.0405111304580.23940-100000@gannet.stats>

BTW, I forgot to mention that using multiple lhs's (but not all 12,000 at
once) in a call to lm will help a lot: you may well be able to do 1000
fits in the same time as one, and that may be sufficient speed-up for you.

Can you tell us how big the model matrix X is, and how much singularity
there is (since you set singular = TRUE)?

My guess is that you have a 1344 x p matrix X for large p, and it is the
QR decomposition of X which is taking the time. Both R and S-PLUS use
LINPACK, and it is possibly much faster to use LAPACK with a tuned BLAS.
Since R contains a LAPACK-based QR decomposition, we can probably write an
alternative to lm.fit which would be much faster. However, the exact way
that singularities are handled in lm.fit would be hard to reproduce, which
is why we still use LINPACK there.

On Tue, 11 May 2004 Arne.Muller@aventis.com wrote:

> Hello,
>
> thanks for your reply. I've now done the profiling, and I interpret that the most time is spend in the fortran routine(s):
>
> Each sample represents 0.02 seconds.
> Total run time: 920.219999999453 seconds.
>
> Total seconds: time spent in function and callees.
> Self seconds: time spent in function alone.
>
> % total % self
> total seconds self seconds name
> 100.00 920.22 0.02 0.16 "lm"
> 99.96 919.88 0.10 0.88 "lm.fit"
> 99.74 917.84 99.74 917.84 ".Fortran"
> 0.07 0.66 0.02 0.14 "storage.mode<-"
> 0.06 0.52 0.00 0.00 "eval"
> 0.06 0.52 0.04 0.34 "as.double"
> 0.02 0.22 0.02 0.22 "colnames<-"
> 0.02 0.20 0.02 0.20 "structure"
> 0.02 0.18 0.02 0.18 "model.matrix.default"
> 0.02 0.18 0.02 0.18 "as.double.default"
> 0.02 0.18 0.00 0.00 "model.matrix"
> 0.01 0.08 0.01 0.08 "list"
>
> % self % total
> self seconds total seconds name
> 99.74 917.84 99.74 917.84 ".Fortran"
> 0.10 0.88 99.96 919.88 "lm.fit"
> 0.04 0.34 0.06 0.52 "as.double"
> 0.02 0.22 0.02 0.22 "colnames<-"
> 0.02 0.20 0.02 0.20 "structure"
> 0.02 0.18 0.02 0.18 "as.double.default"
> 0.02 0.18 0.02 0.18 "model.matrix.default"
> 0.02 0.16 100.00 920.22 "lm"
> 0.02 0.14 0.07 0.66 "storage.mode<-"
> 0.01 0.08 0.01 0.08 "list"
>
> I guess this actually means I cannot do anything about it ... other than maybe splitting the problem into different (independaent parts - which I actually may be able to).
>
> Regarding the usage of lm.fit instead of lm, this might be a good idea, since I am using the same model.matrix for all fits! However, I'd need to recreate an lm object from the output, because I'd like to run the anova function on this. I'll first do some profiling on lm versus lm.fit for the 12,000 models ...
>
> kind regards + thanks again for your help,
>
> Arne
>
> --
> Arne Muller, Ph.D.
> Toxicogenomics, Aventis Pharma
> arne dot muller domain=aventis com
>
> > -----Original Message-----
> > From: Prof Brian Ripley [mailto:ripley@stats.ox.ac.uk]
> > Sent: 11 May 2004 09:08
> > To: Muller, Arne PH/FR
> > Cc: r-help@stat.math.ethz.ch
> > Subject: Re: [R] R versus SAS: lm performance
> >
> >
> > The way to time things in R is system.time().
> >
> > Without knowing much more about your problem we can only
> > guess where R is
> > spending the time. But you can find out by profiling -- see
> > `Writing R
> > Extensions'.
> >
> > If you want multiple fits with the same design matrix (do you?) you
> > could look at the code of lm and call lm.fit repeatedly yourself.
> >
> > On Mon, 10 May 2004 Arne.Muller@aventis.com wrote:
> >
> > > Hello,
> > >
> > > A collegue of mine has compared the runtime of a linear
> > model + anova in SAS and S+. He got the same results, but SAS
> > took a bit more than a minute whereas S+ took 17 minutes.
> > I've tried it in R (1.9.0) and it took 15 min. Neither
> > machine run out of memory, and I assume that all machines
> > have similar hardware, but the S+ and SAS machines are on
> > windows whereas the R machine is Redhat Linux 7.2.
> > >
> > > My question is if I'm doing something wrong (technically)
> > calling the lm routine, or (if not), how I can optimize the
> > call to lm or even using an alternative to lm. I'd like to
> > run about 12,000 of these models in R (for a gene expression
> > experiment - one model per gene, which would take far too long).
> > >
> > > I've run the follwong code in R (and S+):
> > >
> > > > options(contrasts=c('contr.helmert', 'contr.poly'))
> > >
> > > The 1st colum is the value to be modeled, and the others
> > are factors.
> > >
> > > > names(df.gene1data) <- c("Va", "Ba", "Ti", "Do", "Ar", "Pr")
> > > > df[c(1:2,1343:1344),]
> > > Va Do Ti Ba Ar Pr
> > > 1 2.317804 000mM 24h NEW 1 1
> > > 2 2.495390 000mM 24h NEW 2 1
> > > 8315 2.979641 025mM 04h PRG 83 16
> > > 8415 4.505787 000mM 04h PRG 84 16
> > >
> > > this is a dataframe with 1344 rows.
> > >
> > > x <- Sys.time();
> > > wlm <- lm(Va ~
> > >
> > Ba+Ti+Do+Pr+Ba:Ti+Ba:Do+Ba:Pr+Ti:Do+Ti:Pr+Do:Pr+Ba:Ti:Do+Ba:Ti
> > :Pr+Ba:Do:Pr+Ti:Do:Pr+Ba:Ti:Do:Pr+(Ba:Ti:Do)/Ar, data=df, singular=T);
> > > difftime(Sys.time(), x)
> > >
> > > Time difference of 15.33333 mins
> > >
> > > > anova(wlm)
> > > Analysis of Variance Table
> > >
> > > Response: Va
> > > Df Sum Sq Mean Sq F value Pr(>F)
> > > Ba 2 0.1 0.1 0.4262 0.653133
> > > Ti 1 2.6 2.6 16.5055 5.306e-05 ***
> > > Do 4 6.8 1.7 10.5468 2.431e-08 ***
> > > Pr 15 5007.4 333.8 2081.8439 < 2.2e-16 ***
> > > Ba:Ti 2 3.2 1.6 9.8510 5.904e-05 ***
> > > Ba:Do 7 2.8 0.4 2.5054 0.014943 *
> > > Ba:Pr 30 80.6 2.7 16.7585 < 2.2e-16 ***
> > > Ti:Do 4 8.7 2.2 13.5982 9.537e-11 ***
> > > Ti:Pr 15 2.4 0.2 1.0017 0.450876
> > > Do:Pr 60 10.2 0.2 1.0594 0.358551
> > > Ba:Ti:Do 7 1.4 0.2 1.2064 0.296415
> > > Ba:Ti:Pr 30 5.6 0.2 1.1563 0.259184
> > > Ba:Do:Pr 105 14.2 0.1 0.8445 0.862262
> > > Ti:Do:Pr 60 14.8 0.2 1.5367 0.006713 **
> > > Ba:Ti:Do:Pr 105 15.8 0.2 0.9382 0.653134
> > > Ba:Ti:Do:Ar 56 26.4 0.5 2.9434 2.904e-11 ***
> > > Residuals 840 134.7 0.2
> > >
> > > The corresponding SAS program from my collegue is:
> > >
> > > proc glm data = "the name of the data set";
> > >
> > > class B T D A P;
> > >
> > > model V = B T D P B*T B*D B*P T*D T*P D*P B*T*D B*T*P B*D*P
> > T*D*P B*T*D*P A(B*T*D);
> > >
> > > run;
> > >
> > > Note, V = Va, B = Ba, T = Ti, D = Do, P = Pr, A = Ar of the
> > R-example
> >
> > --
> > Brian D. Ripley, ripley@stats.ox.ac.uk
> > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> > University of Oxford, Tel: +44 1865 272861 (self)
> > 1 South Parks Road, +44 1865 272866 (PA)
> > Oxford OX1 3TG, UK Fax: +44 1865 272595
> >
> >
>
>

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________ R-help@stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


About this list Date view Thread view Subject view Author view Attachment view

This archive was generated by hypermail 2.1.3 : Mon 31 May 2004 - 23:05:09 EST