Re: [Rd] arbitrary size data frame or other stcucts, curious about issues invovled.

From: Mike Marchywka <marchywka_at_hotmail.com>
Date: Tue, 21 Jun 2011 13:10:51 -0400

>
> Mike,
>
> this is all nice, but AFAICS the first part misses the point that there is no 64-bit integer type in the API so there is simply no alternative at the moment. You just said that you don't like it, but you failed to provide a solution ... As for the second part, the idea is not new and is noble, but AFAIK no one so far was able to draft any good proposal as of what the API would look like. It would be very desirable if someone did, though. (BTW your link is useless - linking google searches is pointless as the results vary by request location, user setting etc.).

I guess in reverse order, the google link is intended for convenience for those interested as I could not find a specific link and didn't expect much spam to be there ( "its all good" ) so results may not be preidctable but just like floating point should be close enough for the curious analyst. I'm not trying to provide a solution until I understand the problem.

There are many issues with "big data" and I'll try to explain my concerns but they require talking about them in a bit of an integrated way to see how they relate and to see if my understandings are correct about R ( before I dig into it, want to look for the right things).

The 32 bit int still has cardinality of multi-gigs and there are issues of indexes versus memory size. A typical data frame may point to thousands of rows with many colums of mixed type, non being less than 4 bytes of content. So, to simply avoid using up phyiscal memory I would not think the 32 bit issue is a limitation, certainly a square array already has the 64 bit pointer to a given element ( 32*2LOL). An arbitrary size frame, up to the limits of the indexing, could easily exceed physical memory but as I understand it R can bomb at that point or even with VM have speed issues.

Simply being able to select the storage order could be a big deal depending on the access pattern: rows, columns bit reversed, etc. This could prevent VM thrashing well before you hit a 32 bit API limit and be transparent beyond adding a new ctor method. And in fact you may have many larger operands, you may want to tell a give df subclass to ONLY keep so much in physical memory at a time. Resource contention and starvation, fighting for food(data) can be a bottleneck.

data.frame( storage="bit_reversed", physical_mem_limit="some absolute or relative thing here").

In any case, you may be able to imagine adding something like a paging method to a 32 bit api for example that would be transparent to small data sets although I'd have to give it some thought. This would only make sense in cases where aceesses tend to occur in blocks but this could be a lot of situations.

I guess I can look at the big memory and related classes for some idea of what is going on here.

For purely sequential access I guess I was looking for some kind of streaming data source and then anything related to size may be well contained.

>
> Cheers,
> Simon
>
>
> On Jun 21, 2011, at 6:33 AM, Mike Marchywka wrote:
>
> > Thanks,
> >
> > http://cran.r-project.org/doc/manuals/R-ints.html#Future-directions
> >
> > Normally I'd take more time to digest these things before commenting but
> > a few things struck me right away. First, use of floating point or double
> > as a replacement for int strikes me as "going the wrong way" as often
> > to get predictable performance you try to tell the compiler you have
> > ints rather than any floating time for which it is free to "round." This
> > is even ignoring any performance issue. The other thing is that scaling
> > should not just be an issue of "make everything bigger" as the growth in
> > both data needs and computer resources is not uniform.
> >
> > I guess my first thought to these constraints and resource issues
> > is to consider a paged dataframe depending upon the point at which
> > the 32-bit int constraint is imposed. A random access data struct
> > does not always get accessed randomly, and often it is purely sequential.
> > Further down the road, it would be nice if algorithms were implemented in a
> > block mode or could communicate their access patterns to the ds or
> > at least tell it to prefetch things that should be needed soon.
> >
> > I guess I'm thinking mostly along the lines of things I've seen from Intel
> > such as ( first things I could find on their site as I have not looked in detail
> > in quite a while),
> >
> >
> > http://www.google.com/search?hl=en&source=hp&q=site%3Aintel.com+performance+optimization
> >
> > as once you get around thrashing virtual memory, you'd like to preserve the
> > lower level memory cache hit rates too etc. These are probably not just niceities,
> > at least with VM, as personally I've seen impl related speed issues make simple analyses impractical.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >> Subject: RE: arbitrary size data frame or other stcucts, curious about issues invovled.
> >> From: jayemerson_at_gmail.com
> >> To: marchywka_at_hotmail.com; r-devel_at_r-project.org
> >>
> >> Mike,
> >>
> >>
> >> Neither bigmemory nor ff are "drop in" solutions -- though useful,
> >> they are primarily for data storage and management and allowing
> >> convenient access to subsets of the data. Direct analysis of the full
> >> objects via most R functions is not possible. There are many issues
> >> that could be discussed here (and have, previously), including the use
> >> of 32-bit integer indexing. There is a nice section "Future
> >> Directions" in the R Internals manual that you might want to look at.
> >>
> >> Jay
> >>
> >>
> >> ------------------------------------- Original message:
> >>
> >> We keep getting questions on r-help about memory limits and
> >> I was curious to know what issues are involved in making
> >> common classes like dataframe work with disk and intelligent
> >> swapping? That is, sure you can always rely on OS for VM
> >> but in theory it should be possible to make a data structure
> >> that somehow knows what pieces you will access next and
> >> can keep thos somewhere fast. Now of course algorithms
> >> "should" act locally and be block oriented but in any case
> >> could communicate with data structures on upcoming
> >> access patterns, see a few ms into the future and have the
> >> right stuff prefetched.
> >>
> >> I think things like "bigmemory" exist but perhaps one
> >> issue was that this could not just drop in for data.frame
> >> or does it already solve all the problems?
> >>
> >> Is memory management just a non-issue or is there something
> >> that needs to be done to make large data structures work well?
> >>
> >>
> >> --
> >> John W. Emerson (Jay)
> >> Associate Professor of Statistics
> >> Department of Statistics
> >> Yale University
> >> http://www.stat.yale.edu/~jay
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-devel_at_r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
>
                                               

        [[alternative HTML version deleted]]



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue 21 Jun 2011 - 17:41:30 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 22 Jun 2011 - 10:20:22 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive