Re: [R] efficient code - yet another question

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Thu, 01 May 2008 08:52:50 +0100 (BST)

Some comments

  1. 'Writing R Extensions' has a chapter about looking into bottlenecks ('profiling'). Had you given a working example, I might have shown you some results.

One thing which is clearly suboptimal is to grow components (in your case by cbind) rather than allocate them initially and assign the column needed.

2) You are using matrix algebra extensively, and for that you want to use an optimized BLAS. We provide such on CRAN for Windows' users, ready for download. This can speed up crossprod() and tcrossprod() severalfold (amd more on a platform where threaded BLAS and multi-core CPUS are available).

3) I've referred to 'S Programming' already this morning. In its chapter on efficiency, it shows that for some problems writing C or Fortran is the way to get real efficiency. Yours would appear to be one in which writing Fortran to call LAPACK routines to do this would take only a few minutes.

On Wed, 30 Apr 2008, steven wilson wrote:

> Dear list members;
>
> The code given below corresponds to the PCA-NIPALS (principal
> component analysis) algorithm adapted from the nipals function in the
> package chemometrics. The reason for using NIPALS instead of SVD is
> the ability of this algorithm to handle missing values, but that's a
> different story. I've been trying to find a way to improve (if
> possible) the efficiency of the code, especially when the number of
> components to calculate is higher than 100. I've been working with a

> data set of 500 rows x 700 variables. The code gets really slow when
> the number of PC to calculate is for example 600 (why that number of
> components?....because I need them to detect outliers using another
> algorithm). In my old laptop running Win XP and R 2.7.0 (1GB RAM) it
> takes around 6 or 7 minutes. That shouldn't be a problem for one
> analysis, but when cross-validation is added the time increases
> severely.....Although there are several examples on the R help list
> giving some with 'hints' to improve effciency the truth is that I
> don't know (or I can't see it) the part of the code that can be
> improved (but it is clear that the bottle neck is the for and while
> loops). I would really appreciate any ideas/comments/etc...
>
> Thanks
>
> #################################################################
>
> pca.nipals <- function(X, ncomp, iter = 50, toler = sqrt(.Machine$double.eps))
> # X...data matrix, ncomp...number of components,
> # iter...maximal number of iterations per component,
> # toler...precision tolerance for calculation of components
> {
>
> Xh <- scale(X, center = TRUE, scale = FALSE)
> nr <- 0
> T <- NULL; P <- NULL # matrix of scores and loadings
> for (h in 1:ncomp)
> {
> th <- Xh[, 1]
> ende <- FALSE
> while (!ende)
> {
> nr <- nr + 1
> ph <- t(crossprod(th, Xh) * as.vector(1 /
> crossprod(th)))
> ph <- ph * as.vector(1/sqrt(crossprod(ph)))
> thnew <- t(tcrossprod(t(ph), Xh) *
> as.vector(1/(crossprod(ph))))
> prec <- crossprod(th-thnew)
> th <- thnew
> if (prec <= (tol^2)) ende <- TRUE
> if (it <= nr) ende <- TRUE # didn't converge
> }
>
> Xh <- Xh - tcrossprod(th)
> T <- cbind(T, th); P <- cbind(P, ph)
> nr <- 0
> }
> list(T = T, P = P)
> }
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley_at_stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Thu 01 May 2008 - 08:02:54 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 01 May 2008 - 08:30:34 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive