RE: [R] R versus SAS: lm performance

About this list Date view Thread view Subject view Author view Attachment view

From: Arne.Muller@aventis.com
Date: Wed 12 May 2004 - 03:50:12 EST


Message-id: <C80ECAFA2ACC1B45BE45D133ED660ADE010BF1D7@crbsmxsusr04.pharma.aventis.com>

Thanks All, for your help. There seems to be a lot I can try to speed up the fits. However, I'd like to go for a much simpler model which I think is justified by the experiment itself, e.g; I may think about removing the nestinh "(Ba:Ti:Do)/Ar".

The model matrix has 1344 rows and 2970 columns, and the rank of the matrix is 504. Therefore I think I should reformulate the model.

I was just stroke my the massive difference in performance when my collegue told me about the difference between SAS and S+.

        kind regards,

        Arne

--
Arne Muller, Ph.D.
Toxicogenomics, Aventis Pharma
arne dot muller domain=aventis com

> -----Original Message----- > From: Liaw, Andy [mailto:andy_liaw@merck.com] > Sent: 11 May 2004 14:20 > To: Muller, Arne PH/FR; ripley@stats.ox.ac.uk > Cc: r-help@stat.math.ethz.ch > Subject: RE: [R] R versus SAS: lm performance > > > I tried the following on an Opteron 248, R-1.9.0 w/Goto's BLAS: > > > y <- matrix(rnorm(14000*1344), 1344) > > x <- matrix(runif(1344*503),1344) > > system.time(fit <- lm(y~x)) > [1] 106.00 55.60 265.32 0.00 0.00 > > The resulting fit object is over 600MB. (The coefficient > compoent is a 504 > x 14000 matrix.) > > If I'm not mistaken, SAS sweeps on the extended cross product > matrix to fit > regression models. That, I believe, in usually faster than doing QR > decomposition on the model matrix itself, but there are > trade-offs. You > could try what Prof. Bates suggested. > > Andy > > > From: Arne.Muller@aventis.com > > > > Hello, > > > > thanks for your reply. I've now done the profiling, and I > > interpret that the most time is spend in the fortran routine(s): > > > > Each sample represents 0.02 seconds. > > Total run time: 920.219999999453 seconds. > > > > Total seconds: time spent in function and callees. > > Self seconds: time spent in function alone. > > > > % total % self > > total seconds self seconds name > > 100.00 920.22 0.02 0.16 "lm" > > 99.96 919.88 0.10 0.88 "lm.fit" > > 99.74 917.84 99.74 917.84 ".Fortran" > > 0.07 0.66 0.02 0.14 "storage.mode<-" > > 0.06 0.52 0.00 0.00 "eval" > > 0.06 0.52 0.04 0.34 "as.double" > > 0.02 0.22 0.02 0.22 "colnames<-" > > 0.02 0.20 0.02 0.20 "structure" > > 0.02 0.18 0.02 0.18 "model.matrix.default" > > 0.02 0.18 0.02 0.18 "as.double.default" > > 0.02 0.18 0.00 0.00 "model.matrix" > > 0.01 0.08 0.01 0.08 "list" > > > > % self % total > > self seconds total seconds name > > 99.74 917.84 99.74 917.84 ".Fortran" > > 0.10 0.88 99.96 919.88 "lm.fit" > > 0.04 0.34 0.06 0.52 "as.double" > > 0.02 0.22 0.02 0.22 "colnames<-" > > 0.02 0.20 0.02 0.20 "structure" > > 0.02 0.18 0.02 0.18 "as.double.default" > > 0.02 0.18 0.02 0.18 "model.matrix.default" > > 0.02 0.16 100.00 920.22 "lm" > > 0.02 0.14 0.07 0.66 "storage.mode<-" > > 0.01 0.08 0.01 0.08 "list" > > > > I guess this actually means I cannot do anything about it ... > > other than maybe splitting the problem into different > > (independaent parts - which I actually may be able to). > > > > Regarding the usage of lm.fit instead of lm, this might be a > > good idea, since I am using the same model.matrix for all > > fits! However, I'd need to recreate an lm object from the > > output, because I'd like to run the anova function on this. > > I'll first do some profiling on lm versus lm.fit for the > > 12,000 models ... > > > > kind regards + thanks again for your help, > > > > Arne > > > > -- > > Arne Muller, Ph.D. > > Toxicogenomics, Aventis Pharma > > arne dot muller domain=aventis com > > > > > -----Original Message----- > > > From: Prof Brian Ripley [mailto:ripley@stats.ox.ac.uk] > > > Sent: 11 May 2004 09:08 > > > To: Muller, Arne PH/FR > > > Cc: r-help@stat.math.ethz.ch > > > Subject: Re: [R] R versus SAS: lm performance > > > > > > > > > The way to time things in R is system.time(). > > > > > > Without knowing much more about your problem we can only > > > guess where R is > > > spending the time. But you can find out by profiling -- see > > > `Writing R > > > Extensions'. > > > > > > If you want multiple fits with the same design matrix (do > you?) you > > > could look at the code of lm and call lm.fit repeatedly yourself. > > > > > > On Mon, 10 May 2004 Arne.Muller@aventis.com wrote: > > > > > > > Hello, > > > > > > > > A collegue of mine has compared the runtime of a linear > > > model + anova in SAS and S+. He got the same results, but SAS > > > took a bit more than a minute whereas S+ took 17 minutes. > > > I've tried it in R (1.9.0) and it took 15 min. Neither > > > machine run out of memory, and I assume that all machines > > > have similar hardware, but the S+ and SAS machines are on > > > windows whereas the R machine is Redhat Linux 7.2. > > > > > > > > My question is if I'm doing something wrong (technically) > > > calling the lm routine, or (if not), how I can optimize the > > > call to lm or even using an alternative to lm. I'd like to > > > run about 12,000 of these models in R (for a gene expression > > > experiment - one model per gene, which would take far too long). > > > > > > > > I've run the follwong code in R (and S+): > > > > > > > > > options(contrasts=c('contr.helmert', 'contr.poly')) > > > > > > > > The 1st colum is the value to be modeled, and the others > > > are factors. > > > > > > > > > names(df.gene1data) <- c("Va", "Ba", "Ti", "Do", "Ar", "Pr") > > > > > df[c(1:2,1343:1344),] > > > > Va Do Ti Ba Ar Pr > > > > 1 2.317804 000mM 24h NEW 1 1 > > > > 2 2.495390 000mM 24h NEW 2 1 > > > > 8315 2.979641 025mM 04h PRG 83 16 > > > > 8415 4.505787 000mM 04h PRG 84 16 > > > > > > > > this is a dataframe with 1344 rows. > > > > > > > > x <- Sys.time(); > > > > wlm <- lm(Va ~ > > > > > > > Ba+Ti+Do+Pr+Ba:Ti+Ba:Do+Ba:Pr+Ti:Do+Ti:Pr+Do:Pr+Ba:Ti:Do+Ba:Ti > > > :Pr+Ba:Do:Pr+Ti:Do:Pr+Ba:Ti:Do:Pr+(Ba:Ti:Do)/Ar, data=df, > > singular=T); > > > > difftime(Sys.time(), x) > > > > > > > > Time difference of 15.33333 mins > > > > > > > > > anova(wlm) > > > > Analysis of Variance Table > > > > > > > > Response: Va > > > > Df Sum Sq Mean Sq F value Pr(>F) > > > > Ba 2 0.1 0.1 0.4262 0.653133 > > > > Ti 1 2.6 2.6 16.5055 5.306e-05 *** > > > > Do 4 6.8 1.7 10.5468 2.431e-08 *** > > > > Pr 15 5007.4 333.8 2081.8439 < 2.2e-16 *** > > > > Ba:Ti 2 3.2 1.6 9.8510 5.904e-05 *** > > > > Ba:Do 7 2.8 0.4 2.5054 0.014943 * > > > > Ba:Pr 30 80.6 2.7 16.7585 < 2.2e-16 *** > > > > Ti:Do 4 8.7 2.2 13.5982 9.537e-11 *** > > > > Ti:Pr 15 2.4 0.2 1.0017 0.450876 > > > > Do:Pr 60 10.2 0.2 1.0594 0.358551 > > > > Ba:Ti:Do 7 1.4 0.2 1.2064 0.296415 > > > > Ba:Ti:Pr 30 5.6 0.2 1.1563 0.259184 > > > > Ba:Do:Pr 105 14.2 0.1 0.8445 0.862262 > > > > Ti:Do:Pr 60 14.8 0.2 1.5367 0.006713 ** > > > > Ba:Ti:Do:Pr 105 15.8 0.2 0.9382 0.653134 > > > > Ba:Ti:Do:Ar 56 26.4 0.5 2.9434 2.904e-11 *** > > > > Residuals 840 134.7 0.2 > > > > > > > > The corresponding SAS program from my collegue is: > > > > > > > > proc glm data = "the name of the data set"; > > > > > > > > class B T D A P; > > > > > > > > model V = B T D P B*T B*D B*P T*D T*P D*P B*T*D B*T*P B*D*P > > > T*D*P B*T*D*P A(B*T*D); > > > > > > > > run; > > > > > > > > Note, V = Va, B = Ba, T = Ti, D = Do, P = Pr, A = Ar of the > > > R-example > > > > > > -- > > > Brian D. Ripley, ripley@stats.ox.ac.uk > > > Professor of Applied Statistics, > http://www.stats.ox.ac.uk/~ripley/ > > > University of Oxford, Tel: +44 1865 272861 (self) > > > 1 South Parks Road, +44 1865 272866 (PA) > > > Oxford OX1 3TG, UK Fax: +44 1865 272595 > > > > > > > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > > > > -------------------------------------------------------------- > ---------------- > Notice: This e-mail message, together with any attachments, > contains information of Merck & Co., Inc. (One Merck Drive, > Whitehouse Station, New Jersey, USA 08889), and/or its > affiliates (which may be known outside the United States as > Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as > Banyu) that may be confidential, proprietary copyrighted > and/or legally privileged. It is intended solely for the use > of the individual or entity named on this message. If you > are not the intended recipient, and have received this > message in error, please notify us immediately by reply > e-mail and then delete it from your system. > -------------------------------------------------------------- > ---------------- >

______________________________________________ R-help@stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


About this list Date view Thread view Subject view Author view Attachment view

This archive was generated by hypermail 2.1.3 : Mon 31 May 2004 - 23:05:09 EST