Re: [R] Dynamic Dictionary Data Type?

From: Duncan Murdoch <murdoch_at_stats.uwo.ca>
Date: Fri 03 Jun 2005 - 05:37:29 EST

On 6/2/2005 2:40 PM, Prof Brian Ripley wrote:
> On Thu, 2 Jun 2005, Duncan Murdoch wrote:
>

>> Gabor Grothendieck wrote:
>>> On 6/2/05, Prof Brian Ripley <ripley@stats.ox.ac.uk> wrote:
>>> 
>>>> On Thu, 2 Jun 2005, hadley wickham wrote:
>>>> 
>>>> 
>>>>>> An environment is a hash table, and copying occurs only on modification,
>>>>>> when any language would have to copy in this context.
>>>> 
>>>> Caveat: the default is new.env(hash=FALSE), so an environment is a hash
>>>> table in one sense but not necessarily so in the strict sense.
>>> 
>>> 
>>> Can you expand on this?  When would one use hash = TRUE vs.
>>> the default?
>>
>> It's an optimization question.  hash = TRUE uses more memory and takes longer 
>> to set up, but will give faster searches in big environments. The current 
>> default assumes that environments are small so the effort of building the 
>> hash table is not worthwhile, and it does linear searches.

>
> It's not really size: building small hash tables is quick. The issue is
> more to do with whether there are many lookups done compared to entries.
>
> We met the same issues for a named vector a while back. The relevant NEWS
> item was
>
> o Indexing a vector by a character vector was slow if both the
> vector and index were long (say 10,000). Now hashing is used
> and the time should be linear in the longer of the lengths
> (but more memory is used).
>
>
>> I suspect that we might have the default wrong (or perhaps should make the 
>> transition from linear to hash search automatically at a certain threshold 
>> size), but I haven't done the testing necessary to show this.

>
> Here's an example
>
> tr <- as.character(trunc(runif(1e5, max=100)))
> system.time({
> env <- new.env(hash=F)
> for(i in 1:1e5) assign(tr[i], i, envir=env)
> }, gcFirst=TRUE)
>
> which takes about 5% less with hashing. Now change to max=1e4: the hashed
> version takes about 50% longer, the unhashed one 120x.
>

That tests lots of lookups with fairly big tables. I just ran the following code to check on relatively few lookups in small tables:

result <- matrix(NA, 10,10)

timeit <- function(lookups, entries, reps=1000) {

     tr <- as.character(trunc(runif(lookups, max=entries)))
     hash <- system.time({
     	for (n in 1:reps) {
     	    env <- new.env(hash=T)
     	    for(i in 1:lookups) assign(tr[i], i, envir=env)
       }
     }, gcFirst=TRUE)[1]
     nohash <- system.time({
     	for (n in 1:reps) {
     	    env <- new.env(hash=F)
     	    for(i in 1:lookups) assign(tr[i], i, envir=env)
     	}
     }, gcFirst=TRUE)[1]
     hash/nohash

}

for (i in 1:10) for (j in 1:10) result[i,j] <- timeit(2^i,2^j)

This gives the ratio of times from hashed versus non-hashed for 2^i lookups from 2^j possible names. The results were as follows:

 > round(result, 3)

        [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

  [1,] 1.000 0.333 1.000 1.000 1.000 1.000 0.500 3.000 1.000 0.500
  [2,] 1.000 1.000 0.750 1.000 1.000 1.333 1.000 1.000 1.000 1.000
  [3,] 0.800 0.600 1.000 1.000 0.800 1.000 0.800 1.000 1.000 1.000
  [4,] 1.000 0.800 1.143 1.125 0.889 1.000 1.000 1.000 1.000 0.700
  [5,] 1.067 0.938 1.143 1.000 0.941 1.000 1.063 1.067 0.938 1.143
  [6,] 1.000 1.034 0.966 0.838 1.000 1.031 1.000 1.000 1.091 1.065
  [7,] 1.000 1.000 0.891 0.952 0.984 0.970 0.957 0.960 1.041 0.986
  [8,] 1.009 0.992 0.992 0.984 0.904 0.928 0.906 0.905 0.900 0.898
  [9,] 1.040 1.008 0.967 0.972 0.957 0.922 0.884 0.778 0.810 0.789
[10,] 1.022 0.998 1.002 0.976 0.930 0.886 0.799 0.759 0.723 0.657

So hashing in the small tables doesn't look much slower (maybe 5%? I need more reps...) than linear lookup, even taking into account the hash table creation. I think we should switch to hash=TRUE as our default.

Duncan Murdoch



R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Received on Fri Jun 03 05:49:06 2005

This archive was generated by hypermail 2.1.8 : Fri 03 Mar 2006 - 03:32:22 EST