Re: [Rd] arbitrary size data frame or other stcucts, curious about issues invovled.

From: Mike Marchywka <>
Date: Tue, 21 Jun 2011 06:33:23 -0400


Normally I'd take more time to digest these things before commenting but a few things struck me right away. First, use of floating point or double as a replacement for int strikes me as "going the wrong way" as often to get predictable performance you try to tell the compiler you have ints rather than any floating time for which it is free to "round." This is even ignoring any performance issue. The other thing is that scaling should not just be an issue of "make everything bigger" as the growth in both data needs and computer resources is not uniform.

I guess my first thought to these constraints and resource issues is to consider a paged dataframe depending upon the point at which the 32-bit int constraint is imposed. A random access data struct does not always get accessed randomly, and often it is purely sequential. Further down the road, it would be nice if algorithms were implemented in a block mode or could communicate their access patterns to the ds or at least tell it to prefetch things that should be needed soon.

I guess I'm thinking mostly along the lines of things I've seen from Intel such as ( first things I could find on their site as I have not looked in detail in quite a while),

as once you get around thrashing virtual memory, you'd like to preserve the lower level memory cache hit rates too etc. These are probably not just niceities, at least with VM, as personally I've seen impl related speed issues make simple analyses impractical.

> Subject: RE: arbitrary size data frame or other stcucts, curious about issues invovled.
> From:
> To:;
> Mike,
> Neither bigmemory nor ff are "drop in" solutions -- though useful,
> they are primarily for data storage and management and allowing
> convenient access to subsets of the data. Direct analysis of the full
> objects via most R functions is not possible. There are many issues
> that could be discussed here (and have, previously), including the use
> of 32-bit integer indexing. There is a nice section "Future
> Directions" in the R Internals manual that you might want to look at.
> Jay
> ------------------------------------- Original message:
> We keep getting questions on r-help about memory limits and
> I was curious to know what issues are involved in making
> common classes like dataframe work with disk and intelligent
> swapping? That is, sure you can always rely on OS for VM
> but in theory it should be possible to make a data structure
> that somehow knows what pieces you will access next and
> can keep thos somewhere fast. Now of course algorithms
> "should" act locally and be block oriented but in any case
> could communicate with data structures on upcoming
> access patterns, see a few ms into the future and have the
> right stuff prefetched.
> I think things like "bigmemory" exist but perhaps one
> issue was that this could not just drop in for data.frame
> or does it already solve all the problems?
> Is memory management just a non-issue or is there something
> that needs to be done to make large data structures work well?
> --
> John W. Emerson (Jay)
> Associate Professor of Statistics
> Department of Statistics
> Yale University

        [[alternative HTML version deleted]] mailing list Received on Tue 21 Jun 2011 - 11:36:57 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 21 Jun 2011 - 17:50:21 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.

list of date sections of archive