Re: [Rd] modifying large R objects in place

From: Petr Savicky <savicky_at_cs.cas.cz>
Date: Sat, 29 Sep 2007 10:28:34 +0200

On Fri, Sep 28, 2007 at 08:14:45AM -0500, Luke Tierney wrote: [...]
> [...] A related issue is that user-defined
> assignment functions always see a NAMED of 2 and hence cannot modify
> in place. We've been trying to come up with a reasonable solution to
> this, so far without success but I'm moderately hopeful.

If a user-defined function evaluates its body in its parent environment using the suggestion of Peter Dalgaard eval.parent(substitute( .... )), then NAMED attribute is not increased and the function may do in place modifications.

On Fri, Sep 28, 2007 at 12:39:30AM +0200, Peter Dalgaard wrote:
> Longer-term, I still have some hope for better reference counting, but
> the semantics of environments make it really ugly -- an environment can
> contain an object that contains the environment, a simple example being
>
> f <- function()
> g <- function() 0
> f()
>

On Fri, Sep 28, 2007 at 09:46:39AM -0400, Duncan Murdoch wrote:
> f has no input; it's output is the function g, whose environment is the
> evaluation environment of f. g is never used, but it is returned as the
> value of f. Thus we have the loop:
>
> g refers to the environment.
> the environment contains g.
>
> Even though the result of f() was never saved, two things (the
> environment and g) got created and each would have non-zero reference
> count.

Thank you very much for the example and explanation. I would not guess, something like this is possible, but now I see that it may, in fact, be quite common. For example   something <- function()
  {

      a <- 1:5
      b <- 6:10
      c <- c("a","a","b","b","b")
      mf <- model.frame(c ~ a + b)
      mf

  }
  mf1 <- something()
  e1 <- attr(attr(mf1,"terms"),".Environment")   mf2 <- eval(expression(mf),envir=e1)
  e2 <- attr(attr(mf2,"terms"),".Environment")   print(identical(e1,e2)) # TRUE
seems to be a similar situation. Here, the references go in the sequence mf1 -> e1 -> mf2 -> e1. I think that already mf2 is the same as mf1, but I do not know how to demonstrate this. However, both mf1 and mf2 refer to the same environment, so e1 -> mf2 -> e1 is a cycle for sure.

On Fri, Sep 28, 2007 at 08:14:45AM -0500, Luke Tierney wrote:
> >If yes, is it possible during gc() to determine also cases,
> >when NAMED may be dropped from 2 to 1? How much would this increase
> >the complexity of gc()?
>
> Probably not impossible but would be a fair bit of work with probably
> not much gain as the NAMED values would still be high until the next
> gc of the appropriate level, which will probably be a fair time as an
> object being modified is likely to be older, but the interval in which
> there would be a benefit is short.

On Fri, Sep 28, 2007 at 04:36:40PM +0100, Prof Brian Ripley wrote: [...]
> On Fri, 28 Sep 2007, Luke Tierney wrote:
[...]
> >approach may be possible. A related issue is that user-defined
> >assignment functions always see a NAMED of 2 and hence cannot modify
> >in place. We've been trying to come up with a reasonable solution to
> >this, so far without success but I'm moderately hopeful.
>
> I am not persuaded that the difference between NAMED=1/2 makes much
> difference in general use of R, and I recall Ross saying that he no longer
> believed that this was a worthwhile optimization. It's not just
> 'user-defined' replacement functions, but also all the system-defined
> closures (including all methods for the generic replacement functions
> which are primitive) that are unable to benefit from it.

I am thinking about the following situation. The user creates a large matrix A and then performs a sequence of operations on it. Some of the operations scan the matrix in a read-only manner (calculating e.g. some summaries), some operations are top level commands, which modify the matrix itself. I do not argue that such a sequence of operations should be done in place by default. However, I think that R should provide tools, which allow to do this in place, if the user does some extra work. If the matrix is really large, then in place operations are not only more space efficient, but also more time efficient.

Using the information from the current thread, there are two possible approaches to reach this.

  1. The initial matrix should not be generated by "matrix" function due to the observation by Henrik Bengtsson (this is the issue with dimnames). The matrix may be initiated using e.g. .Internal(matrix(data, nrow, ncol, byrow))

   The matrix should not be scanned using an R function, which evaluates    its body in its own enviroment. This includes functions nrow, ncol,    colSums, rowSums and probaly more. The matrix may be scanned by    functions, which use eval.parent(substitute( .... )) and avoid giving    the matrix a new name. The user may prepare versions of nrow, ncol,    colSums, rowSums, etc. with this property.

2. If NAMED attribute of A may be decreased from 2 to 1 during an operation

   similar to garbage collection (if A is not in a reference cycle), then the    above approach may be combined also with operations, which work themselves    in place and read only, but increase NAMED(A) as a side effect. In this    case, the user should explicitly invoke the "NAMED reduction" after such    operations. If the user has only a small number of large objects, then    gc() is faster then duplication of some of the large things. So, I expect    that the "NAMED reduction" could be also more time efficient than some    of the unwanted duplications.

During the previous discussion, the exact counting of references was sometimes mentioned. So, I want to explicitly state that I do not think, it is a good idea. In my opinion, it is definitely not reasonable now. I am very satisfied with the stability of R sessions and this would be in danger during the transition to full counting. Moreover, I can imagine (I am not an expert on this) that the efficiency and simplicity benefit of the guaranteed approximate counting outweigh the disadvantages (a bit more duplications than necessary) in a typical R session.

However, there are situations, where the cost of duplication is too high and the user knows about it in advance. In such situations, having more tools for explicit control of duplication could help. The tools may be, for example, some function, which allows a simple query to the NAMED status of a given object on R level and modifying some of the built-in functions to be more careful with NAMED attribute. A possible strengthening of gc() would, of course, be very useful here. I think about an explicit use of it, not about the automatical runs. So, for safety reasons, the "NAMED reduction" could be done by a different function, not the default gc() itself.

Petr Savicky.



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sat 29 Sep 2007 - 08:32:23 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Sat 29 Sep 2007 - 14:41:35 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.