Re: [Rd] speeding up perception

From: Simon Urbanek <simon.urbanek_at_r-project.org>
Date: Mon, 04 Jul 2011 12:41:44 -0400

Timothée,

On Jul 4, 2011, at 2:47 AM, Timothée Carayol wrote:

> Hi --
> 
> It's my first post on this list; as a relatively new user with little
> knowledge of R internals, I am a bit intimidated by the depth of some
> of the discussions here, so please spare me if I say something
> incredibly silly.
> 
> I feel that someone at this point should mention Matthew Dowle's
> excellent data.table package
> (http://cran.r-project.org/web/packages/data.table/index.html) which
> seems to me to address many of the inefficiencies of data.frame.
> data.tables have no row names; and operations that only need data from
> one or two columns are (I believe) just as quick whether the total
> number of columns is 5 or 1000. This results in very quick operations
> (and, often, elegant code as well).
> 

I agree that data.table is a very good alternative (for other reasons) that should be promoted more. The only slight snag is that it doesn't help with the issue at hand since it simply does a pass-though for subassignments to data frame's methods and thus suffers from the same problems (in fact there is a rather stark asymmetry in how it handles subsetting vs subassignment - which is a bit surprising [if I read the code correctly you can't use the same indexing in both]). In fact I would propose that it should not do that but handle the simple cases itself more efficiently without unneeded copies. That would make it indeed a very interesting alternative.

Cheers,
Simon

> 
> On Mon, Jul 4, 2011 at 6:19 AM, ivo welch <ivo.welch_at_gmail.com> wrote:

>> thank you, simon. this was very interesting indeed. I also now
>> understand how far out of my depth I am here.
>>
>> fortunately, as an end user, obviously, *I* now know how to avoid the
>> problem. I particularly like the as.list() transformation and back to
>> as.data.frame() to speed things up without loss of (much)
>> functionality.
>>
>>
>> more broadly, I view the avoidance of individual access through the
>> use of apply and vector operations as a mixed "IQ test" and "knowledge
>> test" (which I often fail). However, even for the most clever, there
>> are also situations where the KISS programming principle makes
>> explicit loops still preferable. Personally, I would have preferred
>> it if R had, in its standard "statistical data set" data structure,
>> foregone the row names feature in exchange for retaining fast direct
>> access. R could have reserved its current implementation "with row
>> names but slow access" for a less common (possibly pseudo-inheriting)
>> data structure.
>>
>>
>> If end users commonly do iterations over a data frame, which I would
>> guess to be the case, then the impression of R by (novice) end users
>> could be greatly enhanced if the extreme penalties could be eliminated
>> or at least flagged. For example, I wonder if modest special internal
>> code could store data frames internally and transparently as lists of
>> vectors UNTIL a row name is assigned to. Easier and uglier, a simple
>> but specific warning message could be issued with a suggestion if
>> there is an individual read/write into a data frame ("Warning: data
>> frames are much slower than lists of vectors for individual element
>> access").
>>
>>
>> I would also suggest changing the "Introduction to R" 6.3 from "A
>> data frame may for many purposes be regarded as a matrix with columns
>> possibly of differing modes and attributes. It may be displayed in
>> matrix form, and its rows and columns extracted using matrix indexing
>> conventions." to "A data frame may for many purposes be regarded as a
>> matrix with columns possibly of differing modes and attributes. It may
>> be displayed in matrix form, and its rows and columns extracted using
>> matrix indexing conventions. However, data frames can be much slower
>> than matrices or even lists of vectors (which, like data frames, can
>> contain different types of columns) when individual elements need to
>> be accessed." Reading about it immediately upon introduction could
>> flag the problem in a more visible manner.
>>
>>
>> regards,
>>
>> /iaw
>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
> 
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 

______________________________________________
R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Mon 04 Jul 2011 - 16:45:09 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Tue 05 Jul 2011 - 11:00:06 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive