[Rd] Unnecessary extra copy with matrix(..., dimnames=NULL) (Was: Re: modifying large R objects in place)

From: Henrik Bengtsson <hb_at_stat.berkeley.edu>
Date: Thu, 27 Sep 2007 11:30:34 -0700


As others already mentioned, in your example you are first creating an integer matrix and the coercing it to a double matrix by assigning (double) 1 to element [1,1]. However, even when correcting for this mistake, there is an extra copy created when using matrix().

Try this in a fresh vanilla R session:

> print(gc())

         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 136684  3.7     350000  9.4   350000  9.4
Vcells  81026  0.7     786432  6.0   473127  3.7

> x <- matrix(1, nrow=5000, ncol=5000)
> print(gc())
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   136793   3.7     350000   9.4   350000   9.4
Vcells 25081043 191.4 27989266 213.6 25081056 191.4
> x[1,1] <- 2
> print(gc())
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   136797   3.7     350000   9.4   350000   9.4
Vcells 25081044 191.4 52830254 403.1 50081058 382.1

So, yes, in that x[1,1] <- 2 assignment an extra copy is created. It is related to to the fact that there is NAMED matrix object being created inside matrix(), cf. the last rows in matrix():

    x <- .Internal(matrix(data, nrow, ncol, byrow))     dimnames(x) <- dimnames
    x

Here is a patch for matrix() that avoids this problem *when dimnames is NULL* (which is many time the case):

matrix <- function(data=NA, nrow=1, ncol=1, byrow=FALSE, dimnames=NULL) {   data <- as.vector(data);

  if(missing(nrow)) {

    nrow <- ceiling(length(data)/ncol);
  } else if(missing(ncol)) {
    ncol <- ceiling(length(data)/nrow);

  }

  # Trick to avoid extra copy in the case when 'dimnames' is NULL.   if (is.null(dimnames)) {
    .Internal(matrix(data, nrow, ncol, byrow));   } else {
    x <- .Internal(matrix(data, nrow, ncol, byrow));     dimnames(x) <- dimnames;
    x;
  }
} # matrix()

Try the above again in a fresh R session with this patch applied and you'll get:

> print(gc())

         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 136805  3.7     350000  9.4   350000  9.4
Vcells  81122  0.7     786432  6.0   473127  3.7

> x <- matrix(1, nrow=5000, ncol=5000)
> print(gc())
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   136919   3.7     350000   9.4   350000   9.4
Vcells 25081139 191.4 27989372 213.6 25081152 191.4
> x[1,1] <- 2
> print(gc())
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   136923   3.7     350000   9.4   350000   9.4
Vcells 25081140 191.4 29468840 224.9 25081276 191.4

Voila!

I talked to Luke Tierney about this and he though the internal method should be updated to take the dimnames argument, i.e. .Internal(matrix(data, nrow, ncol, byrow, dimnames)). However, until that is happening, may I suggest this simple patch/workaround to go in R v2.6.0?

Cheers

Henrik

On 9/27/07, Petr Savicky <savicky_at_cs.cas.cz> wrote:
> On Wed, Sep 26, 2007 at 10:52:28AM -0700, Byron Ellis wrote:
> > For the most part, doing anything to an R object result in it's
> > duplication. You generally have to do a lot of work to NOT copy an R
> > object.

>

> Thank you for your response. Unfortunately, you are right. For example,
> the allocated memory determined by top command on Linux may change during
> a session as follows:
> a <- matrix(as.integer(1),nrow=14100,ncol=14100) # 774m
> a[1,1] <- 0 # 3.0g
> gc() # 1.5g
>

> In the current applicatin, I modify the matrix only using my own C code
> and only read it on R level. So, the above is not a big problem for me
> (at least not now).
>

> However, there is a related thing, which could be a bug. The following
> code determines the value of NAMED field in SEXP header of an object:
>

> SEXP getnamed(SEXP a)
> {
> SEXP out;
> PROTECT(out = allocVector(INTSXP, 1));
> INTEGER(out)[0] = NAMED(a);
> UNPROTECT(1);
> return(out);
> }
>

> Now, consider the following session
>

> u <- matrix(as.integer(1),nrow=5,ncol=3) + as.integer(0)
> .Call("getnamed",u) # 1 (OK)
>

> length(u)
> .Call("getnamed",u) # 1 (OK)
>

> dim(u)
> .Call("getnamed",u) # 1 (OK)
>

> nrow(u)
> .Call("getnamed",u) # 2 (why?)
>

> u <- matrix(as.integer(1),nrow=5,ncol=3) + as.integer(0)
> .Call("getnamed",u) # 1 (OK)
> ncol(u)
> .Call("getnamed",u) # 2 (so, ncol does the same)
>

> Is this a bug?
>

> Petr Savicky.
>

> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Thu 27 Sep 2007 - 18:33:46 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Thu 27 Sep 2007 - 18:41:41 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.