[R] loop over large dataset

From: Federico Calboli <f.calboli_at_imperial.ac.uk>
Date: Fri 01 Jul 2005 - 21:31:48 EST

Hi All,

I'd like to ask for a few clarifications. I am doing some calculations over some biggish datasets. One has ~ 23000 rows, and 6 columns, the other has ~620000 rows and 6 columns.

I am using these datasets to perform a simulation of of haplotype coalescence over a pedigree (the datestes themselves are pedigree information). I created a new dataset (same number of rows as the pedigree dataset, 2 colums) and I use a looping functions to assign haplotypes according to a standrd biological reprodictive process (i.e. meiosis, sexual reproduction).

My code is someting like:

  off = function(sire, dam){ # simulation of reproduction, two inds   sch.toll = round(runif(1, min = 1, max = 2))   dch.toll = round(runif(1, min = 1, max = 2))   s.gam = sire[,sch.toll]
  d.gam = dam[,dch.toll]
  offspring = cbind(s.gam,d.gam)
# offspring

for (i in 1:dim(new)[1]){
if(ped[i,3] != 0 & ped[i,5] != 0){
zz = off(as.matrix(t(new[ped[i,3],])),as.matrix(t(new[ped[i,5],]))) new[i,1] = zz[1]
new[i,2] = zz[2]

I am also attribution a generation index to each row with a trivial calulation:

for(i in atres){
  genz[i] = (genz[ped[i,3]] + genz[ped[i,5]])/2 + 1   #print(genz[i])

My question then. On the 23000 rows dataset the calculations take about 5 minutes. On the 620000 rows one I kill the process after ~24 hours, and the the job is not finished. Why such immense discrepancy in execution times (the code is the same, the datasets are stored in two separate .RData files)?

Any light would be appreciated.


PS I am running R 2.1.0 on Debian Sarge, on a Dual 3 GHz Xeon machine with 2 gig RAM. The R process uses 99% of the CPU, but hardly any RAM for what I gather from top.

Federico C. F. Calboli
Department of Epidemiology and Public Health
Imperial College, St Mary's Campus
Norfolk Place, London W2 1PG

Tel  +44 (0)20 7594 1602     Fax (+44) 020 7594 3193

f.calboli [.a.t] imperial.ac.uk
f.calboli [.a.t] gmail.com

R-help@stat.math.ethz.ch mailing list
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Received on Fri Jul 01 21:41:54 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:33:08 EST