Re: [Rd] Some R questions

From: miguel manese <>
Date: Thu 02 Nov 2006 - 03:07:19 GMT

On 11/2/06, Vladimir Dergachev <> wrote:
> On Tuesday 31 October 2006 9:30 pm, miguel manese wrote:
> The slowness manifests itself for vectorized code as well. I believe it is due
> to the code mucking about with row.names attribute which introduces a penalty
> on any [,] operation - penalty that grows linearly with the number of rows.
> Thus for large data frames A[,1] is slower than A[[1]]. For example, for the
> data frame I mentioned above E<-A[[1]] took 0.46 seconds (way too much in my
> opinion), but E<-A[,1] took 62.45 seconds - more than a minute and more than
> twice the time it took to load the entire thing into memory. Silly, isn't
> it ?
> Also, there are good reasons to want to address individual cells. And there is
> no reason why such access cannot be constant time.
Yeah, it should be O(1) because a data frame is just a list of vectors and everything is in memory: index the column in the list, then the row on the vector. For non-vectorized code, the problem is more of the loop overhead (maintaining loop variables) which is done on R instead of in C.

> > <pimp-my-project>
> > Or, you may just use (and pour your effort on improving) SQLiteDF
> >
> > </pimp-my-project>
> Very nice ! The documentation mentioned something about assignment operator
> not working - is this still true ? Or, maybe, I misunderstood something ?
Yes, unfortunately, still no [<- operator. For every way that a data frame can index-ed (or subscript-ed), that's how many ways the data frames can be mutated. There are many other things more "fun" than coding that (graphics!, extending sqlite syntax, R expression evaluation), but I'd do that on the weekend.

> Also, I wonder whether it would be possible to extend [[ operator so one can
> run queries: SQLDF[["SELECT * FROM a WHERE.."]]
That has been suggested before, but in retrospect this can be achieved more "poetically" as

sdf[sdf$a>3 && sdf$b=="i",] # where a>3 and b == 'i'

although not as efficient. I have been thinking of adding a method like

select(sdf, select=<select_clause>,where=<where_clause>,ordery_by=order_by_clause)

so that sum(sdf$a) can just be done with select(sdf, "sum(a)"), and not go .Call("..."). It can also optimize stuff, like with(sdf, a+b) can be done with select(sdf, "a+b").

M. Manese mailing list Received on Fri Nov 03 06:48:41 2006

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.1.8, at Thu 02 Nov 2006 - 21:30:36 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.