From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk>

Date: Sun, 03 Aug 2008 08:07:03 +0100 (BST)

Date: Sun, 03 Aug 2008 08:07:03 +0100 (BST)

There are several issues here, and a good knowledge of the R Internals manual seems a prerequisite (and, considering where this started, of the relevant help pages!).

R uses its integer type for indexing vectors and arrays (which can be indexed as vectors or via an appropriate number of indices). So if we allow more than 2^31-1 elements we need a way to index them in R. One idea would be to make R's integer type int64 on suitable platforms, but that would have really extensive ramifications (in the R sources and in packages). Another idea (I think suggested by Luke Tierney) is to allow double() indices which might be less disruptive. Adding a new type would be a very serious amount of work (speaking as someone who has done it).

Another issue is the use of Fortran for matrix algebra. That is likely to cause portability issues, and there's no point in doing it unless one has an efficient implementation of e.g. BLAS/LAPACK, the reference implementation being slow at even 100 million. (That's probably not an empty set, as I see ACML has a int64 BLAS.)

There are lots of portability issues -- e.g. the current save() format is the same on all systems and we have complete interoperability. (That cannot be the case if we allow big vectors to be saved.)

But at present I don't see a significant number of applications any time soon. 2 billion items in a homogenous group is a *lot* of data. I know there are applications with 2 billion items of data already, but is it appropriate to store them in a single vector or matrix rather than say a data frame or DBMS tables? And will there 'ever' be more than a tiny demand for such applications? (I am excluding mainly-zero vectors, and Martin has already pointed out that we have ways to deal with those.)

It is three or four years since we first discussed some of the options, and at the time we thought it would be about five years before suitably large machines became available to more than a handful of R users. That still seems about right: >=128GB systems (which is about what you need for larger than 16GB objects) may start to become non-negligible in a year or two.

R is a volunteer project with limited resources -- there are AFAIK less than a handful of people with the knowledge of R internals to tackle these issues. Only if one of them has a need to work with larger datasets is this likely to be worked on.

>>>>>> "VK" == Vadim Kutsyy <vadim@kutsyy.com>

*>>>>>> on Fri, 01 Aug 2008 10:22:43 -0700 writes:
**>
**> VK> Martin Maechler wrote:
**> >> [[Topic diverted from R-help]]
**> >>
**> >> Well, fortunately, reasonable compilers have indeed kept
**> >> 'long' == 'long int' to mean 32-bit integers ((less
**> >> reasonable compiler writers have not, AFAIK: which leads
**> >> of course to code that no longer compiles correctly when
**> >> originally it did)) But of course you are right that
**> >> 64-bit integers (typically == 'long long', and really ==
**> >> 'int64') are very natural on 64-bit architectures. But
**> >> see below.
**>
**> ... I wrote complete rubbish,
**> and I am embarrassed ...
**>
**> >>
**> VK> well in 64bit Ubunty, /usr/include/limits.h defines:
**>
**> VK> /* Minimum and maximum values a `signed long int' can hold. */
**> VK> # if __WORDSIZE == 64
**> VK> # define LONG_MAX 9223372036854775807L
**> VK> # else
**> VK> # define LONG_MAX 2147483647L
**> VK> # endif
**> VK> # define LONG_MIN (-LONG_MAX - 1L)
**>
**> VK> and using simple code to test
**> VK> (http://home.att.net/~jackklein/c/inttypes.html#int) my desktop, which
**> VK> is standard Intel computer, does show.
**>
**> VK> Signed long min: -9223372036854775808 max: 9223372036854775807
**>
**> yes. I am really embarrassed.
**>
**> What I was trying to say was that
**> the definition of int / long /... should not change when going
**> from 32bit architecture to 64bit
**> and that the R internal structures consequently should also be
**> the same on 32-bit and 64-bit platforms
**>
**> >> If you have too large a numeric matrix, it would be larger than
**> >> 2^31 * 8 bytes ~= 2^34 / 2^20 ~= 16'000 Megabytes.
**> >> If that is is 10% only for you, you'd have around 160 GB of
**> >> RAM. That's quite a impressive.
**> >>
**> >> cat /proc/meminfo | grep MemTotal
**> VK> MemTotal: 145169248 kB
**>
**> VK> We have "smaller" SGI NUMAflex to play with, where the memory can
**> VK> increased to 512Gb ("larger" version doesn't have this "limitation").
**> VK> But with even commodity hardware you can easily get 128Gb for reasonable
**> VK> price (i.e. Dell PowerEdge R900)
**>
**> >> Note that R objects are (pointers to) C structs that are
**> >> "well-defined" platform independently, and I'd say that this
**> >> should remain so.
**>
**> >>
**> VK> I forgot that R stores two dimensional array in a single dimensional C
**> VK> array. Now I understand why there is a limitation on total number of
**> VK> elements. But this is a big limitations.
**>
**> Yes, maybe
**>
**> >> One of the last times this topic came up (within R-core),
**> >> we found that for all the matrix/vector operations,
**> >> we really would need versions of BLAS / LAPACK that would also
**> >> work with these "big" matrices, ie. such a BLAS/Lapack would
**> >> also have to internally use "longer int" for indexing.
**> >> At that point in time, we had decied we would at least wait to
**> >> hear about the development of such BLAS/LAPACK libraries
**>
**> VK> BLAS supports two dimensional metrics definition, so if we would store
**> VK> matrix as two dimensional object, we would be fine. But than all R code
**> VK> as well as all packages would have to be modified.
**>
**> exactly. And that was what I meant when I said "Compatibility".
**>
**> But rather than changing the
**> "matrix = colmunwise stored as long vector" paradigm, should
**> rather change from 32-bit indexing to longer one.
**>
**> The hope is that we eventually make up a scheme
**> which would basically allow to just recompile all packages :
**>
**> In src/include/Rinternals.h,
**> we have had the following three lines for several years now:
**> ------------------------------------------------------------------------------------
**> /* type for length of vectors etc */
**> typedef int R_len_t; /* will be long later, LONG64 or ssize_t on Win64 */
**> #define R_LEN_T_MAX INT_MAX
**> ------------------------------------------------------------------------------------
**>
**> and you are right, that it may be time to experiment a bit more
**> with replacing 'int' with long (and also the corresponding _MAX)
**> setting there,
**> and indeed, in the array.c code you cited, should repalce
**> INT_MAX by R_LEN_T_MAX
**>
**> This still does not solve the problem that we'd have to get to
**> a BLAS / Lapack version that correctly works with "long indices"...
**> which may (or may not) be easier than I had thought.
**>
**> Martin
**>
**> ______________________________________________
**> R-devel_at_r-project.org mailing list
**> https://stat.ethz.ch/mailman/listinfo/r-devel
**>
*

-- Brian D. Ripley, ripley_at_stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-devel_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-develReceived on Sun 03 Aug 2008 - 07:11:36 GMT

Archive maintained by Robert King, hosted by
the discipline of
statistics at the
University of Newcastle,
Australia.

Archive generated by hypermail 2.2.0, at Mon 04 Aug 2008 - 06:36:24 GMT.

*
Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel.
Please read the posting
guide before posting to the list.
*