Re: [R] speeding up loop and dealing wtih memory problems

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>
Date: Mon, 28 Jul 2008 15:42:45 +0100 (BST)

We were not told this was a matrix, rather a 'dataset'.

If it is matrix, logical indexing via is.na(x) is pretty good, although it will create an index equal in size to the dataset (but logical).

If 'dataset' means a data frame, you will use less memory using a for() loop over columns, e.g.

for(i in seq_along(x)) x[[i]][is.na(x[[i]])] <- 0

and I suspect this approach would also use less memory for a matrix.

A quick check with 40 columns of 0.5m rows showed that Jim's approach needed 860Mb, whereas mine needed 460Mb (and the object is 150Mb)

On Mon, 28 Jul 2008, jim holtman wrote:

> If your matrix is 835353x86, then if it is numeric, then it will take
> about 550MB for a single copy. You should therefore have at least 2GB
> (so you can have a couple of copies as part of some processing) of
> real memory on your system. If you want to replace NAs with zero,
> then this is how you might do it with 'vectorization':
>
>> x
> [,1] [,2] [,3] [,4] [,5] [,6]
> [1,] 1 NA NA 2 1 2
> [2,] 2 2 2 NA 2 2
> [3,] 2 2 NA NA 1 2
> [4,] NA 1 2 1 2 1
> [5,] 1 1 NA 2 NA NA
> [6,] NA 1 NA 1 2 NA
>> x[is.na(x)] <- 0
>> x
> [,1] [,2] [,3] [,4] [,5] [,6]
> [1,] 1 0 0 2 1 2
> [2,] 2 2 2 0 2 2
> [3,] 2 2 0 0 1 2
> [4,] 0 1 2 1 2 1
> [5,] 1 1 0 2 0 0
> [6,] 0 1 0 1 2 0
>
> Maybe you should read the Intro To R to understand how vectorization works.
>
> Same way with your last loop:
>
> x[is.na(x[,4]), 4] <- 0
>
>
>
>
> On Mon, Jul 28, 2008 at 9:15 AM, Denise Xifara
> <dionysia-kiara.xifaras_at_st-hildas.ox.ac.uk> wrote:
>> Dear All and Mark,
>>
>> Given a dataset that I have called dat, I was hoping to speed up the
>> following loop:
>>
>> for(i in 1:835353){
>> for(j in 1:86){
>> if (is.na(dat[i,j])==TRUE){dat[i,j]<-0 }}}
>> Actually I am also having a memory problem. I get the following:
>>
>> Error: cannot allocate vector of size 3.2 Mb
>> In addition: Warning messages:
>> 1: In dat[i, j] <- 0 :
>> Reached total allocation of 1535Mb: see help(memory.size)
>> 2: In dat[i, j] <- 0 :
>> Reached total allocation of 1535Mb: see help(memory.size)
>> 3: In dat[i, j] <- 0 :
>> Reached total allocation of 1535Mb: see help(memory.size)
>> 4: In dat[i, j] <- 0 :
>> Reached total allocation of 1535Mb: see help(memory.size)
>>
>> If I try and apply the loop just to a particular column, rather than the
>> whole dataset, so that I dont have the memory problem, ie
>>
>> for(i in 1:835353){
>> if (is.na(dat[i,4])==TRUE){dat[i,4]<-0 }}
>>
>> it takes ridiculously long to process, so I was hoping that there would be a
>> quicker way to do this.
>>
>> Thank you all very much for the help,
>> Denise
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem you are trying to solve?
>
> ______________________________________________
> R-help_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Brian D. Ripley,                  ripley_at_stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Mon 28 Jul 2008 - 15:02:25 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 28 Jul 2008 - 15:32:42 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.

list of date sections of archive