R-alpha: Memory

Ross Ihaka (ihaka@stat.auckland.ac.nz)
Wed, 22 May 1996 14:01:23 +1200


Date: Wed, 22 May 1996 14:01:23 +1200
From: Ross Ihaka <ihaka@stat.auckland.ac.nz>
Message-Id: <199605220201.OAA28307@stat.auckland.ac.nz>
To: bent@stat.ubc.ca
Subject: R-alpha: Memory
In-Reply-To: <199605211637.JAA28247@fisher.stat.ubc.ca>

bent@stat.ubc.ca writes:

 > >From S-NEWS@utstat.toronto.edu
 > 
 > The message below is from the S-Plus list. The first message is a reply to
 > the second. I wonder how R would cope with such large data. I believe that
 > GLIM does not form the design matrix explicitly, and I thought this 
 > would now be standard, but maybe not?  
 >     Bent

[ letters elided ]
 

Hmmm. Letsee.

	17065 * 107 * 8 = 14607640

Let's say 15 megabytes - not too bad.

However, there is a point in the computation where R would need three
copies of this matrix (yep that's right, THREE), so we're up to 45Mb.
I'd say invoking R with the command

	R -v60

might do it (but you better have about 80Mb of RAM in you machine if
you want your computation to terminate before the universe ends).
I'm not sure how many copies S uses makes during a fit, but it isn't
likely to be much different.

The problem here is not the size of the design matrix -- 15Mb is quite
small, but rather the fact that both R and S try to be "call-by-value"
and so tend to make a fair number of copies of things.

Even with the enormous amount of RAM present in most modern computers
having three copies of a big design matrix around does seem rather
wasteful, but the only way I can see around this is to move all the
interpreted fitting code into hand-coded C.  This would be a LARGE
undertaking.

Because of this, a future Son-of-R (vaguely on the drawing boards)
will be implemented using call-by-reference instead of call-by-value
(there are other reasons for doing this besides cutting down memory
use).  I'm going to be muttering somthing about this at the upcoming
interface meeting.


It is true that you don't need the entire design matrix to compute
regression results, but keeping the QR decomposition around as a basic
summary statistic is very useful (try getting the hat matrix out of
GLIM).  We (like S) use the Householder form of QR, because it has the
most compact representation.
	Ross
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
r-testers mailing list -- To (un)subscribe, send
subscribe	or	unsubscribe
(in the "body", not the subject !)  To: r-testers-request@stat.math.ethz.ch
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-