[Rd] Recent changes in R related to CHARSXPs

From: Seth Falcon <sfalcon_at_fhcrc.org>
Date: Fri, 25 May 2007 08:36:00 -0700

Hello all,

I want to highlight a recent change in R-devel to the larger developeR community. As of r41495, R maintains a global cache of CHARSXPs such that each unique string is stored only once in memory. For many common use cases, such as dimnames of matrices and keys in environments, the result is a significant savings in memory (and time under some circumstances).

A result of these changes is that CHARSXPs must be treated as read only objects and must never be modified in-place by assigning to the char* returned by CHAR(). If you maintain a package that manipulates CHARSXPs, you should check to see if you make such in-place modifications. If you do, the general solution is as follows:

   If you need a temp char buffer, you can allocate one with a new    helper macro like this:

     /* CallocCharBuf takes care of the +1 for the \0,
        so the size argument is the length of your string.
     char *tmp = CallocCharBuf(n);

     /* manipulate tmp */
     SEXP schar = mkChar(tmp);

   You can also use R_alloc which has the advantage of not having to    free it in a .Call function.

The mkChar function now consults the global CHARSXP cache and will return an already existing CHARSXP if one with a matching string exists. Otherwise, it will create a new one and add it to the cache before returning it.

In a discussion with Herve Pages, he suggested that the return type of CHAR(), at least for package code, be modified from (char *) to (const char *). I think this is an excellent suggestion because it will allow the compiler to alert us to package C code that might be modifying CHARSXPs in-place. This hasn't happened yet, but I'm hoping that a patch for this will be applied soon (unless better suggestions for improvement arise through this discussion :-)

One other thing is worth mentioning: at present, not all CHARSXPs are captured by the cache. I think the goal is to refine things so that all CHARSXPs _are_ in the cache. At that point, strcmp calls can be replaced with pointer comparisons which should provide some nice speed ups. So part of the idea is that the way to get CHARSXPs is via mkChar or mkString and that one should not use allocString, etc.

Finally, here is a comparison of time and memory for loading all the environments (hash tables) in Bioconductor's GO annotation data package.

## unpatched

> gc()

             used (Mb) gc trigger (Mb) max used (Mb)
    Ncells 168891  9.1     350000 18.7   350000 18.7
    Vcells 115731  0.9     786432  6.0   425918  3.3

> library("GO")
> system.time(for (e in ls(2)) get(e))
       user  system elapsed
     51.919   1.168  53.228

> gc()
used (Mb) gc trigger (Mb) max used (Mb)
    Ncells 17879072 954.9 19658017 1049.9 18683826 997.9     Vcells 31702823 241.9 75190268 573.7 53912452 411.4

## patched

> gc()

             used (Mb) gc trigger (Mb) max used (Mb)
    Ncells 154717  8.3     350000 18.7   350000 18.7
    Vcells 133613  1.1     786432  6.0   483138  3.7

> library("GO")
> system.time(for (e in ls(2)) get(e))
       user  system elapsed
     31.166   0.736  31.998

> gc()
used (Mb) gc trigger (Mb) max used (Mb)
    Ncells 5837253 311.8 6910418 369.1 6193578 330.8     Vcells 16831859 128.5 45712717 348.8 39456690 301.1

Best Wishes,

+ seth

Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center

R-devel_at_r-project.org mailing list
Received on Fri 25 May 2007 - 15:38:16 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 25 May 2007 - 17:33:57 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.