Re: [Rd] [datatable-help] speeding up perception

From: Matthew Dowle <mdowle_at_mdowle.plus.com>
Date: Tue, 05 Jul 2011 08:32:45 +0100

Simon,

Thanks for the great suggestion. I've written a skeleton assignment function for data.table which incurs no copies, which works for this case. For completeness, if I understand correctly, this is for :   i) convenience of new users who don't know how to vectorize yet   ii) more complex examples which can't be vectorized.

Before:

> system.time(for (r in 1:R) DT[r,20] <- 1.0)

   user system elapsed
 12.792 0.488 13.340

After :

> system.time(for (r in 1:R) DT[r,20] <- 1.0)

   user system elapsed
  2.908 0.020 2.935

Where this can be reduced further as follows :

> system.time(for (r in 1:R) `[<-.data.table`(DT,r,2,1.0))

   user system elapsed
  0.132 0.000 0.131
>

Still working on it. When it doesn't break other data.table tests, I'll commit to R-Forge ...

Matthew

On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:
> Timothée,
>
> On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:
>
> > Hi --
> >
> > It's my first post on this list; as a relatively new user with little
> > knowledge of R internals, I am a bit intimidated by the depth of some
> > of the discussions here, so please spare me if I say something
> > incredibly silly.
> >
> > I feel that someone at this point should mention Matthew Dowle's
> > excellent data.table package
> > (http://cran.r-project.org/web/packages/data.table/index.html) which
> > seems to me to address many of the inefficiencies of data.frame.
> > data.tables have no row names; and operations that only need data from
> > one or two columns are (I believe) just as quick whether the total
> > number of columns is 5 or 1000. This results in very quick operations
> > (and, often, elegant code as well).
> >
>
> I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative.
>
> Cheers,
> Simon
>
>
> >
> > On Mon, Jul 4, 2011 at 6:19 AM, ivo welch <ivo.welch_at_gmail.com> wrote:
> >> thank you, simon. this was very interesting indeed. I also now
> >> understand how far out of my depth I am here.
> >>
> >> fortunately, as an end user, obviously, *I* now know how to avoid the
> >> problem. I particularly like the as.list() transformation and back to
> >> as.data.frame() to speed things up without loss of (much)
> >> functionality.
> >>
> >>
> >> more broadly, I view the avoidance of individual access through the
> >> use of apply and vector operations as a mixed "IQ test" and "knowledge
> >> test" (which I often fail). However, even for the most clever, there
> >> are also situations where the KISS programming principle makes
> >> explicit loops still preferable. Personally, I would have preferred
> >> it if R had, in its standard "statistical data set" data structure,
> >> foregone the row names feature in exchange for retaining fast direct
> >> access. R could have reserved its current implementation "with row
> >> names but slow access" for a less common (possibly pseudo-inheriting)
> >> data structure.
> >>
> >>
> >> If end users commonly do iterations over a data frame, which I would
> >> guess to be the case, then the impression of R by (novice) end users
> >> could be greatly enhanced if the extreme penalties could be eliminated
> >> or at least flagged. For example, I wonder if modest special internal
> >> code could store data frames internally and transparently as lists of
> >> vectors UNTIL a row name is assigned to. Easier and uglier, a simple
> >> but specific warning message could be issued with a suggestion if
> >> there is an individual read/write into a data frame ("Warning: data
> >> frames are much slower than lists of vectors for individual element
> >> access").
> >>
> >>
> >> I would also suggest changing the "Introduction to R" 6.3 from "A
> >> data frame may for many purposes be regarded as a matrix with columns
> >> possibly of differing modes and attributes. It may be displayed in
> >> matrix form, and its rows and columns extracted using matrix indexing
> >> conventions." to "A data frame may for many purposes be regarded as a
> >> matrix with columns possibly of differing modes and attributes. It may
> >> be displayed in matrix form, and its rows and columns extracted using
> >> matrix indexing conventions. However, data frames can be much slower
> >> than matrices or even lists of vectors (which, like data frames, can
> >> contain different types of columns) when individual elements need to
> >> be accessed." Reading about it immediately upon introduction could
> >> flag the problem in a more visible manner.
> >>
> >>
> >> regards,
> >>
> >> /iaw
> >>
> >> ______________________________________________
> >> R-devel_at_r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>
> >
> > ______________________________________________
> > R-devel_at_r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
>
> _______________________________________________
> datatable-help mailing list
> datatable-help_at_lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Tue 05 Jul 2011 - 07:35:45 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 12 Jul 2011 - 10:30:08 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive