[R] Memory leak with character arrays?

From: Peter Waltman <waltman_at_cs.nyu.edu>
Date: Wed 17 Jan 2007 - 21:54:31 GMT


Hi -

When I'm trying to read in a text file into a labeled character array, the memory stamp/footprint of R will exceed 4 gigs or more. I've seen this behavior on Mac OS X, Linux for AMD_64 and X86_64., and the R versions are 2.4, 2.4 and 2.2, respectively. So, it would seem that this is platform and R version independant.

The file that I'm reading contains the upstream regions of the yeast genome, with each upstream region labeled using a FASTA header, i.e.:

    FASTA header for gene 1
    upstream region.....
.....
....

    FASTA header for gene 2
    upstream....
....

The script I use - code below - opens the file, parses for a FASTA header, and then parses the header for the gene name. Once this is done, it reads the following lines which contain the upstream region, and then adds it as an item to the character array, using the gene name as the name of the item it adds. And then continues on to the following genes.

Each upstream region (the text to be added) is 550 bases (characters) long. With ~6000 genes in the file I'm reading it, this would be 550 * 6000 * 8 (if we're using ascii chars) ~= 25 Megs (if we're using ascii chars).

I realize that the character arrays/vectors will have a higher memory stamp b/c they are a named array and most likely aren't storing the text as ascii, but 4 gigs and up seems a bit excessive. Or is it?

For an example, this is the output of top, at the point which R has processed around 5000 genes:

      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 
     4969 waltman   18   0 *6746m 3.4g*  920 D  2.7 88.2  19:09.19 R    

Is this expected behavior? Can anyone recommend a less memory intensive way to store this data? The relevant code that reads in the file follows:

     ....code....
         lines <- readLines( gzfile( seqs.fname ) )
         
          n.seqs <- 0
         
          upstream <- gene.names <- character()
          syn <- character( 0 )
          gene.start <- gene.end <- integer()
          gene <- seq <- ""


          for ( i in 1:length( lines ) ) {
            line <- lines[ i ]
            if ( line == "" ) next
            if ( substr( line, 1, 1 ) == ">" ) {

              if ( seq != "" && gene != "" ) upstream[ gene ] <-
    toupper( seq )
              splitted <- strsplit( line, "\t" )[[ 1 ]]
              splitted <- strsplit( splitted[ 1 ], ";\\ " )[[ 1 ]]
              gene <- toupper( substr( splitted[ 1 ], 2, nchar(
    splitted[ 1 ] ) ) )
              syn <- splitted[ 2 ]
              if ( ! is.null( syn ) &&
                  length( grep( valid.gene.regexp, gene, perl=T ) ) == 0 &&
                  length( grep( valid.gene.regexp, syn, perl=T ) ) == 1
    ) gene <- syn
              else if ( length( grep( valid.gene.regexp, gene, perl=T,
    ignore.case=T ) ) == 0 &&
                       length( grep( valid.gene.regexp, syn, perl=T,
    ignore.case=T ) ) == 0 ) next
              gene.start[ gene ] <- as.integer( splitted[ 9 ] )
              gene.end[ gene ] <- as.integer( splitted[ 10 ] )
              if ( n.seqs %% 100 == 0 ) cat.new( n.seqs, gene, "|", syn,
    "| length=", nchar( seq ),
                               gene.end[gene]-gene.start[gene]+1,"\n" )
              if ( ! is.na( syn ) && syn != "" ) gene.names[ gene ] <- syn
              else gene.names[ gene ] <- toupper( gene )
              n.seqs <- n.seqs + 1
              seq <- ""
            } else {
              seq <- paste( seq, line, sep="" )
            }
          }
          if ( seq != "" && gene != "" ) upstream[ gene ] <- toupper( seq )

     ....code....

Thanks,

Peter Waltman



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Received on Thu Jan 18 09:06:57 2007

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 18 Jan 2007 - 02:30:29 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-help. Please read the posting guide before posting to the list.