Re: [Rd] modifying large R objects in place

From: Luke Tierney <luke_at_stat.uiowa.edu>
Date: Fri, 28 Sep 2007 08:14:45 -0500 (CDT)

On Fri, 28 Sep 2007, Petr Savicky wrote:

> On Fri, Sep 28, 2007 at 12:39:30AM +0200, Peter Dalgaard wrote:
> [...]
>>> nrow <- function(...) dim(...)[1]
>>> ncol <- function(...) dim(...)[2]
>>>
>>> At least in my environment, the new versions preserved NAMED == 1.

I believe this is a bug in the evaluation of ... arguments. THe intent in the code is I believe that all promise evaluations result in NAMED==2 for safety. That may be overly conservative but I would not want to change it without some very careful thought -- I prefer to wait a little longer for the right answer than to get a wrong one quickly.

>>>
>> Yes, but changing the formal arguments is a bit messy, is it not?
>
> Specifically for nrow, ncol, I think not much, since almost nobody needs
> to know (or even knows) that the name of the formal argument is "x".
>
> However, there is another argument against the ... solution: it solves
> the problem only in the simplest cases like nrow, ncol, but is not
> usable in other, like colSums, rowSums. These functions also increase
> NAMED of its argument, although their output does not contain any
> reference to the original content of their arguments.
>
> I think that a systematic solution of this problem may be helpful.
> However, making these functions Internal or Primitive would
> not be good in my opinion. It is advantageous that these functions
> contain an R level part, which
> makes the basic decisions before a call to .Internal.
> If nothing else, this serves as a sort of documentation.
>
> For my purposes, I replaced calls to "colSums" and "matrix" by the
> corresponding calls to .Internal in my script. The result is that
> now I can complete several runs of my calculation in a cycle instead
> of restarting R after each of the runs.
>
> This leads me to a question. Some of the tests, which I did, suggest
> that gc() may not free all the memory, even if I remove all data
> objects by rm() before calling gc(). Is this possible or I must have
> missed something?

Not impossible but very unlikely givent he use gc gets. There are a few internal tables that are grown but not shrunk at the moment but that should not usually cause much total growth. If you are ooking at system memopry use then that is a malloc issue -- there was a thread about this a month or so ago.

> A possible solution to the unwanted increase of NAMED due to temporary
> calculations could be to give the user the possibility
> to store NAMED attribute of an object before a call to a function
> and restore it after the call. To use this, the user should be
> confident that no new reference to the object persists after the
> function is completed.

This would be too dangerous for general use. Some more structured approach may be possible. A related issue is that user-defined assignment functions always see a NAMED of 2 and hence cannot modify in place. We've been trying to come up with a reasonable solution to this, so far without success but I'm moderately hopeful.

>> Presumably, nrow <- function(x) eval.parent(substitute(dim(x)[1])) works
>> too, but if the gain is important enough to warrant that sort of
>> programming, you might as well make nrow a .Primitive.
>
> You are right. This indeed works.
>
>> Longer-term, I still have some hope for better reference counting, but
>> the semantics of environments make it really ugly -- an environment can
>> contain an object that contains the environment, a simple example being
>>
>> f <- function()
>> g <- function() 0
>> f()
>>
>> At the end of f(), we should decide whether to destroy f's evaluation
>> environment. In the present example, what we need to be able to see is
>> that this would remove all refences to g and that the reference from g
>> to f can therefore be ignored. Complete logic for sorting this out is
>> basically equivalent to a new garbage collector, and one can suspect
>> that applying the logic upon every function return is going to be
>> terribly inefficient. However, partial heuristics might apply.

>
> I have to say that I do not understand the example very much.
> What is the input and output of f? Is g inside only defined or
> also used?
>
> Let me ask the following question. I assume that gc() scans the whole
> memory and determines for each part of data, whether a reference
> to it still exists or not. In my understanding, this is equivalent to
> determine, whether NAMED of it may be dropped to zero or not.
> Structures, for which this succeeds are then removed. Am I right?
> If yes, is it possible during gc() to determine also cases,
> when NAMED may be dropped from 2 to 1? How much would this increase
> the complexity of gc()?

Probably not impossible but would be a fair bit of work with probably not much gain as the NAMED values would still be high until the next gc of the appropriate level, which will probably be a fair time as an object being modified is likely to be older, but the interval in which there would be a benefit is short.

The basic functional model that underlies having the illuison of non-modifyable vector data does not fit all that well with an imperative style of modifying things in loops. It might be useful to bring in some constructs from functional programming that are designed to allow in-place modification to coexist with functional semantics. Probably a longer term issue.

For now there are limits to what we can reasonable, and maintainably, do in an interpreted R. Having full reference counts might help but might not because of other costs involved (significant increases in cache misses in particular) but in any case it would probably be easier to rewrite R from scratch than to retro-fit full reference cunting to what we have so I an't see it happening real soon. Also it doesn't help with many things, like user-level assignment: there really are two references at the key point in that case. With compilation it may be possible to do some memory use analysis and work out when it is safe to do destructive modification, but that is a fair way off as well.

Best,

luke

>
> Thank you in advance for your kind reply.
>
> Petr Savicky.
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:      luke_at_stat.uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
Received on Fri 28 Sep 2007 - 13:18:04 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 28 Sep 2007 - 16:41:31 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.