Re: [Rd] [datatable-help] speeding up perception

From: <luke-tierney_at_uiowa.edu>
Date: Tue, 05 Jul 2011 18:07:30 -0500

On Tue, 5 Jul 2011, Matthew Dowle wrote:

> Simon (and all),
>
> I've tried to make assignment as fast as calling `[<-.data.table`
> directly, for user convenience. Profiling shows (IIUC) that it isn't
> dispatch, but x being copied. Is there a way to prevent '[<-' from
> copying x? Small reproducible example in vanilla R 2.13.0 :
>
>> x = list(a=1:10000,b=1:10000)
>> class(x) = "newclass"
>> "[<-.newclass" = function(x,i,j,value) x # i.e. do nothing
>> tracemem(x)
> [1] "<0xa1ec758>"
>> x[1,2] = 42L
> tracemem[0xa1ec758 -> 0xa1ec558]: # but, x is still copied, why?
>>

This one is a red herring -- the class(x) <- "newclass" assignment is bumping up the NAMED value and as a result the following assignment needs to duplicate. (the primitive class<- could be modified to avoid the NAMED bump but it's fairly intricate code so I'm not going to look into it now).

[A bit more later in reply to Simon's message]

luke

>
> I've tried returning NULL from [<-.newclass but then x gets assigned
> NULL :
>
>> "[<-.newclass" = function(x,i,j,value) NULL
>> x[1,2] = 42L
> tracemem[0xa1ec558 -> 0x9c5f318]:
>> x
> NULL
>>
>
> Any pointers much appreciated. If that copy is preventable it should
> save the user needing to use `[<-.data.table`(...) syntax to get the
> best speed (20 times faster on the small example used so far).
>
> Matthew
>
>
> On Tue, 2011-07-05 at 08:32 +0100, Matthew Dowle wrote:
>> Simon,
>>
>> Thanks for the great suggestion. I've written a skeleton assignment
>> function for data.table which incurs no copies, which works for this
>> case. For completeness, if I understand correctly, this is for :
>> i) convenience of new users who don't know how to vectorize yet
>> ii) more complex examples which can't be vectorized.
>>
>> Before:
>>
>> > system.time(for (r in 1:R) DT[r,20] <- 1.0)
>> user system elapsed
>> 12.792 0.488 13.340
>>
>> After :
>>
>> > system.time(for (r in 1:R) DT[r,20] <- 1.0)
>> user system elapsed
>> 2.908 0.020 2.935
>>
>> Where this can be reduced further as follows :
>>
>> > system.time(for (r in 1:R) `[<-.data.table`(DT,r,2,1.0))
>> user system elapsed
>> 0.132 0.000 0.131
>> >
>>
>> Still working on it. When it doesn't break other data.table tests, I'll
>> commit to R-Forge ...
>>
>> Matthew
>>
>>
>> On Mon, 2011-07-04 at 12:41 -0400, Simon Urbanek wrote:
>> > Timothée,
>> >
>> > On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:
>> >
>> > > Hi --
>> > >
>> > > It's my first post on this list; as a relatively new user with little
>> > > knowledge of R internals, I am a bit intimidated by the depth of some
>> > > of the discussions here, so please spare me if I say something
>> > > incredibly silly.
>> > >
>> > > I feel that someone at this point should mention Matthew Dowle's
>> > > excellent data.table package
>> > > (http://cran.r-project.org/web/packages/data.table/index.html) which
>> > > seems to me to address many of the inefficiencies of data.frame.
>> > > data.tables have no row names; and operations that only need data from
>> > > one or two columns are (I believe) just as quick whether the total
>> > > number of columns is 5 or 1000. This results in very quick operations
>> > > (and, often, elegant code as well).
>> > >
>> >
>> > I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative.
>> >
>> > Cheers,
>> > Simon
>> >
>> >
>> > >
>> > > On Mon, Jul 4, 2011 at 6:19 AM, ivo welch <ivo.welch_at_gmail.com> wrote:
>> > >> thank you, simon. this was very interesting indeed. I also now
>> > >> understand how far out of my depth I am here.
>> > >>
>> > >> fortunately, as an end user, obviously, *I* now know how to avoid the
>> > >> problem. I particularly like the as.list() transformation and back to
>> > >> as.data.frame() to speed things up without loss of (much)
>> > >> functionality.
>> > >>
>> > >>
>> > >> more broadly, I view the avoidance of individual access through the
>> > >> use of apply and vector operations as a mixed "IQ test" and "knowledge
>> > >> test" (which I often fail). However, even for the most clever, there
>> > >> are also situations where the KISS programming principle makes
>> > >> explicit loops still preferable. Personally, I would have preferred
>> > >> it if R had, in its standard "statistical data set" data structure,
>> > >> foregone the row names feature in exchange for retaining fast direct
>> > >> access. R could have reserved its current implementation "with row
>> > >> names but slow access" for a less common (possibly pseudo-inheriting)
>> > >> data structure.
>> > >>
>> > >>
>> > >> If end users commonly do iterations over a data frame, which I would
>> > >> guess to be the case, then the impression of R by (novice) end users
>> > >> could be greatly enhanced if the extreme penalties could be eliminated
>> > >> or at least flagged. For example, I wonder if modest special internal
>> > >> code could store data frames internally and transparently as lists of
>> > >> vectors UNTIL a row name is assigned to. Easier and uglier, a simple
>> > >> but specific warning message could be issued with a suggestion if
>> > >> there is an individual read/write into a data frame ("Warning: data
>> > >> frames are much slower than lists of vectors for individual element
>> > >> access").
>> > >>
>> > >>
>> > >> I would also suggest changing the "Introduction to R" 6.3 from "A
>> > >> data frame may for many purposes be regarded as a matrix with columns
>> > >> possibly of differing modes and attributes. It may be displayed in
>> > >> matrix form, and its rows and columns extracted using matrix indexing
>> > >> conventions." to "A data frame may for many purposes be regarded as a
>> > >> matrix with columns possibly of differing modes and attributes. It may
>> > >> be displayed in matrix form, and its rows and columns extracted using
>> > >> matrix indexing conventions. However, data frames can be much slower
>> > >> than matrices or even lists of vectors (which, like data frames, can
>> > >> contain different types of columns) when individual elements need to
>> > >> be accessed." Reading about it immediately upon introduction could
>> > >> flag the problem in a more visible manner.
>> > >>
>> > >>
>> > >> regards,
>> > >>
>> > >> /iaw
>> > >>
>> > >> ______________________________________________
>> > >> R-devel_at_r-project.org mailing list
>> > >> https://stat.ethz.ch/mailman/listinfo/r-devel
>> > >>
>> > >
>> > > ______________________________________________
>> > > R-devel_at_r-project.org mailing list
>> > > https://stat.ethz.ch/mailman/listinfo/r-devel
>> > >
>> > >
>> >
>> > _______________________________________________
>> > datatable-help mailing list
>> > datatable-help_at_lists.r-forge.r-project.org
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help_at_lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Luke Tierney
Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:      luke_at_stat.uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu


______________________________________________ R-devel_at_r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel

Received on Tue 05 Jul 2011 - 23:10:14 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 05 Jul 2011 - 23:50:06 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive