Re: [R] loop over large dataset

From: Liaw, Andy <andy_liaw_at_merck.com>
Date: Fri 01 Jul 2005 - 22:12:12 EST


My suggestion is that you try to vectorize the computation as much as you can.

>From what you've shown, `new' and `ped' need to have the same number of
rows, right?

Your `off' function seems to be randomly choosing between columns 1 and 2 from its two input matrices (one row each?). You may want to do the sampling all at once instead of looping over the rows. E.g.,

> (m <- matrix(1:10, ncol=2))

     [,1] [,2]

[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

> (colSample <- sample(1:2, nrow(m), replace=TRUE))
[1] 1 1 2 1 1
> (x <- m[cbind(1:nrow(m), colSample)])
[1] 1 2 8 4 5

So you might want to do something like (obviously untested):

todo <- ped[,3] * ped[,5] != 0 ## indicator of which rows to work on n.todo <- sum(todo) ## how many are there? sire <- new[ped[todo, 3], ]
dam <- new[ped[todo, 5], ]
s.gam <- sire[1:nrow(sire), sample(1:2, nrow(sire), replace=TRUE)] d.gam <- dam[1:nrow(dam), sample(1:2, nrow(dam), replace=TRUE)] new[todo, 1:2] <- cbind(s.gam, d.gam)

Andy

> From: Federico Calboli
>
> Hi All,
>
> I'd like to ask for a few clarifications. I am doing some calculations
> over some biggish datasets. One has ~ 23000 rows, and 6 columns, the
> other has ~620000 rows and 6 columns.
>
> I am using these datasets to perform a simulation of of haplotype
> coalescence over a pedigree (the datestes themselves are pedigree
> information). I created a new dataset (same number of rows as the
> pedigree dataset, 2 colums) and I use a looping functions to assign
> haplotypes according to a standrd biological reprodictive
> process (i.e.
> meiosis, sexual reproduction).
>
> My code is someting like:
>
> off = function(sire, dam){ # simulation of reproduction, two inds
> sch.toll = round(runif(1, min = 1, max = 2))
> dch.toll = round(runif(1, min = 1, max = 2))
> s.gam = sire[,sch.toll]
> d.gam = dam[,dch.toll]
> offspring = cbind(s.gam,d.gam)
> # offspring
> }
>
> for (i in 1:dim(new)[1]){
> if(ped[i,3] != 0 & ped[i,5] != 0){
> zz = off(as.matrix(t(new[ped[i,3],])),as.matrix(t(new[ped[i,5],])))
> new[i,1] = zz[1]
> new[i,2] = zz[2]
> }
> }
>
> I am also attribution a generation index to each row with a trivial
> calulation:
>
> for(i in atres){
> genz[i] = (genz[ped[i,3]] + genz[ped[i,5]])/2 + 1
> #print(genz[i])
> }
>
> My question then. On the 23000 rows dataset the calculations
> take about
> 5 minutes. On the 620000 rows one I kill the process after ~24 hours,
> and the the job is not finished. Why such immense discrepancy in
> execution times (the code is the same, the datasets are stored in two
> separate .RData files)?
>
> Any light would be appreciated.
>
> Federico
>
> PS I am running R 2.1.0 on Debian Sarge, on a Dual 3 GHz Xeon machine
> with 2 gig RAM. The R process uses 99% of the CPU, but hardly any RAM
> for what I gather from top.
>
>
>
> --
> Federico C. F. Calboli
> Department of Epidemiology and Public Health
> Imperial College, St Mary's Campus
> Norfolk Place, London W2 1PG
>
> Tel +44 (0)20 7594 1602 Fax (+44) 020 7594 3193
>
> f.calboli [.a.t] imperial.ac.uk
> f.calboli [.a.t] gmail.com
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>
>
>



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Fri Jul 01 22:17:48 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:33:08 EST