Re: [Rd] [datatable-help] speeding up perception

From: Matthew Dowle <mdowle_at_mdowle.plus.com>
Date: Wed, 06 Jul 2011 09:36:05 +0100

On Tue, 2011-07-05 at 21:11 -0400, Simon Urbanek wrote:
> No subassignment function satisfies that condition, because you can always call them directly. However, that doesn't stop the default method from making that assumption, so I'm not sure it's an issue.
>
> David, Just to clarify - the data frame content is not copied, we are talking about the vector holding columns.

If it is just the vector holding the columns that is copied (and not the columns themselves), why does n make a difference in this test (on R 2.13.0)?

> n = 1000
> x = data.frame(a=1:n,b=1:n)
> system.time(for (i in 1:1000) x[1,1] <- 42L)

   user system elapsed
  0.628 0.000 0.628
> n = 100000
> x = data.frame(a=1:n,b=1:n) # still 2 columns, but longer columns
> system.time(for (i in 1:1000) x[1,1] <- 42L)

   user system elapsed
 20.145 1.232 21.455
>

With $<- :

> n = 1000
> x = data.frame(a=1:n,b=1:n)
> system.time(for (i in 1:1000) x$a[1] <- 42L)

   user system elapsed
  0.304 0.000 0.307
> n = 100000
> x = data.frame(a=1:n,b=1:n)
> system.time(for (i in 1:1000) x$a[1] <- 42L)

   user system elapsed
 37.586 0.388 38.161
>

If it's because the 1st column needs to be copied (only) because that's the one being assigned to (in this test), that magnitude of slow down doesn't seem consistent with the time of a vector copy of the 1st column :

> n=100000
> v = 1:n
> system.time(for (i in 1:1000) v[1] <- 42L)

   user system elapsed
  0.016 0.000 0.017
> system.time(for (i in 1:1000) {v2=v;v2[1] <- 42L})

   user system elapsed
  1.816 1.076 2.900

Finally, increasing the number of columns, again only the 1st is assigned to :

> n=100000
> x = data.frame(rep(list(1:n),100))
> dim(x)

[1] 100000 100
> system.time(for (i in 1:1000) x[1,1] <- 42L)

   user system elapsed
167.974 50.903 219.711
>

>
> Cheers,
> Simon
>
> Sent from my iPhone
>
> On Jul 5, 2011, at 9:01 PM, David Winsemius <dwinsemius_at_comcast.net> wrote:
>
> >
> > On Jul 5, 2011, at 7:18 PM, <luke-tierney_at_uiowa.edu> <luke-tierney_at_uiowa.edu> wrote:
> >
> >> On Tue, 5 Jul 2011, Simon Urbanek wrote:
> >>
> >>>
> >>> On Jul 5, 2011, at 2:08 PM, Matthew Dowle wrote:
> >>>
> >>>> Simon (and all),
> >>>>
> >>>> I've tried to make assignment as fast as calling `[<-.data.table`
> >>>> directly, for user convenience. Profiling shows (IIUC) that it isn't
> >>>> dispatch, but x being copied. Is there a way to prevent '[<-' from
> >>>> copying x?
> >>>
> >>> Good point, and conceptually, no. It's a subassignment after all - see R-lang 3.4.4 - it is equivalent to
> >>>
> >>> `*tmp*` <- x
> >>> x <- `[<-`(`*tmp*`, i, j, value)
> >>> rm(`*tmp*`)
> >>>
> >>> so there is always a copy involved.
> >>>
> >>> Now, a conceptual copy doesn't mean real copy in R since R tries to keep the pass-by-value illusion while passing references in cases where it knows that modifications cannot occur and/or they are safe. The default subassign method uses that feature which means it can afford to not duplicate if there is only one reference -- then it's safe to not duplicate as we are replacing that only existing reference. And in the case of a matrix, that will be true at the latest from the second subassignment on.
> >>>
> >>> Unfortunately the method dispatch (AFAICS) introduces one more reference in the dispatch chain so there will always be two references so duplication is necessary. Since we have only 0 / 1 / 2+ information on the references, we can't distinguish whether the second reference is due to the dispatch or due to the passed object having more than one reference, so we have to duplicate in any case. That is unfortunate, and I don't see a way around (unless we handle subassignment methods is some special way).
> >>
> >> I don't believe dispatch is bumping NAMED (and a quick experiment
> >> seems to confirm this though I don't guarantee I did that right). The
> >> issue is that a replacement function implemented as a closure, which
> >> is the only option for a package, will always see NAMED on the object
> >> to be modified as 2 (because the value is obtained by forcing the
> >> argument promise) and so any R level assignments will duplicate. This
> >> also isn't really an issue of imprecise reference counting -- there
> >> really are (at least) two legitimate references -- one though the
> >> argument and one through the caller's environment.
> >>
> >> It would be good it we could come up with a way for packages to be
> >> able to define replacement functions that do not duplicate in cases
> >> where we really don't want them to, but this would require coming up
> >> with some sort of protocol, minimally involving an efficient way to
> >> detect whether a replacement funciton is being called in a replacement
> >> context or directly.
> >
> > Would "$<-" always satisfy that condition. It would be big help to me if it could be designed to avoid duplication the rest of the data.frame.
> >
> > --
> >
> >>
> >> There are some replacement functions that use C code to cheat, but
> >> these may create problems if called directly, so I won't advertise
> >> them.
> >>
> >> Best,
> >>
> >> luke
> >>
> >>>
> >>> Cheers,
> >>> Simon
> >>>
> >>>
> >>>
> >>
> >> --
> >> Luke Tierney
> >> Statistics and Actuarial Science
> >> Ralph E. Wareham Professor of Mathematical Sciences
> >> University of Iowa Phone: 319-335-3386
> >> Department of Statistics and Fax: 319-335-3017
> >> Actuarial Science
> >> 241 Schaeffer Hall email: luke_at_stat.uiowa.edu
> >> Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu______________________________________________
> >> R-devel_at_r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > David Winsemius, MD
> > West Hartford, CT
> >
> >



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Wed 06 Jul 2011 - 08:38:23 GMT

This quarter's messages: by month, or sorted: [ by date ] [ by thread ] [ by subject ] [ by author ]

All messages

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Wed 06 Jul 2011 - 13:50:06 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive