Re: [R] Memory leak with character arrays?

From: jim holtman <jholtman_at_gmail.com>
Date: Thu 18 Jan 2007 - 01:52:51 GMT

What does the FASTA header look like. You are using 'gene' to access things in the array and if (for example) 'gene' is a character vector of 10, then for every element of vectors that you are using (I count about 4-5 that use this index) then you are going to have at least 550 * 6000 * 5 * 10 more bytes (165MB) used just to store the names of the elements.

You are also dynamically increasing the size of the vectors which means a lot of copying of the objects and therefore using a lot of memory that is probably fragmenting your memory.

So if you look at all these vectors, how many of them will contain data? What you might want to do is to preprocess the data (pass 1) to find out how many 'gene's there are and then create a factor from this. You can then statically allocate the vectors and use the numeric value of the factor to index into the vector.

So you might have fragmentation (that seems to be what your 'ps' command is showing. So it looks like a two pass process: 1) determine how many genes you have and statically allocate, 2) go through the data and use the 'factor' to index into the vectors.

On 1/17/07, Peter Waltman <waltman@cs.nyu.edu> wrote:
>
> Hi -
>
> When I'm trying to read in a text file into a labeled character array,
> the memory stamp/footprint of R will exceed 4 gigs or more. I've seen
> this behavior on Mac OS X, Linux for AMD_64 and X86_64., and the R
> versions are 2.4, 2.4 and 2.2, respectively. So, it would seem that
> this is platform and R version independant.
>
> The file that I'm reading contains the upstream regions of the yeast
> genome, with each upstream region labeled using a FASTA header, i.e.:
>
> FASTA header for gene 1
> upstream region.....
> .....
> ....
> FASTA header for gene 2
> upstream....
> ....
>
> The script I use - code below - opens the file, parses for a FASTA
> header, and then parses the header for the gene name. Once this is
> done, it reads the following lines which contain the upstream region,
> and then adds it as an item to the character array, using the gene name
> as the name of the item it adds. And then continues on to the following
> genes.
>
> Each upstream region (the text to be added) is 550 bases (characters)
> long. With ~6000 genes in the file I'm reading it, this would be 550 *
> 6000 * 8 (if we're using ascii chars) ~= 25 Megs (if we're using ascii
> chars).
>
> I realize that the character arrays/vectors will have a higher memory
> stamp b/c they are a named array and most likely aren't storing the text
> as ascii, but 4 gigs and up seems a bit excessive. Or is it?
>
> For an example, this is the output of top, at the point which R has
> processed around 5000 genes:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 4969 waltman 18 0 *6746m 3.4g* 920 D 2.7 88.2 19:09.19 R
>
> Is this expected behavior? Can anyone recommend a less memory intensive
> way to store this data? The relevant code that reads in the file follows:
>
> ....code....
> lines <- readLines( gzfile( seqs.fname ) )
>
> n.seqs <- 0
>
> upstream <- gene.names <- character()
> syn <- character( 0 )
> gene.start <- gene.end <- integer()
> gene <- seq <- ""
>
>
> for ( i in 1:length( lines ) ) {
> line <- lines[ i ]
> if ( line == "" ) next
> if ( substr( line, 1, 1 ) == ">" ) {
>
> if ( seq != "" && gene != "" ) upstream[ gene ] <-
> toupper( seq )
> splitted <- strsplit( line, "\t" )[[ 1 ]]
> splitted <- strsplit( splitted[ 1 ], ";\\ " )[[ 1 ]]
> gene <- toupper( substr( splitted[ 1 ], 2, nchar(
> splitted[ 1 ] ) ) )
> syn <- splitted[ 2 ]
> if ( ! is.null( syn ) &&
> length( grep( valid.gene.regexp, gene, perl=T ) ) == 0 &&
> length( grep( valid.gene.regexp, syn, perl=T ) ) == 1
> ) gene <- syn
> else if ( length( grep( valid.gene.regexp, gene, perl=T,
> ignore.case=T ) ) == 0 &&
> length( grep( valid.gene.regexp, syn, perl=T,
> ignore.case=T ) ) == 0 ) next
> gene.start[ gene ] <- as.integer( splitted[ 9 ] )
> gene.end[ gene ] <- as.integer( splitted[ 10 ] )
> if ( n.seqs %% 100 == 0 ) cat.new( n.seqs, gene, "|", syn,
> "| length=", nchar( seq ),
> gene.end[gene]-gene.start[gene]+1,"\n" )
> if ( ! is.na( syn ) && syn != "" ) gene.names[ gene ] <- syn
> else gene.names[ gene ] <- toupper( gene )
> n.seqs <- n.seqs + 1
> seq <- ""
> } else {
> seq <- paste( seq, line, sep="" )
> }
> }
> if ( seq != "" && gene != "" ) upstream[ gene ] <- toupper( seq )
>
> ....code....
>
> Thanks,
>
> Peter Waltman
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

	[[alternative HTML version deleted]]

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Received on Thu Jan 18 12:56:12 2007

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 18 Jan 2007 - 08:30:24 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.