Re: [Rd] dict package: dictionary data structure for R

From: Bill Dunlap <bill_at_insightful.com>
Date: Sun, 22 Jul 2007 11:41:51 -0700 (PDT)

On Sat, 21 Jul 2007, Seth Falcon wrote:

> In Bioconductor, we have many hashtables where the key is an
> Affymetrix probeset ID. These look sort of like "1000_at". It turns
> out that the algorithm used by R's environments is not very good at
> hashing these values. The dict package lets you investigate this:
>
> library("dict")
> keys2 = paste(seq(1000, length=13000), "at", sep="_")
>
> # here, hash.alg=0L corresponds to the hashing function used by R's
> # environments. I know, a name would be better.
> > summary(as.integer(table(hashCodes(keys=keys2, hash.alg=0L, size=2^14))))
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 800 1100 1500 1625 2025 2700
> # hash.alg=1L is djb2 from here: http://www.cse.yorku.ca/~oz/hash.html
> > summary(as.integer(table(hashCodes(keys=keys2, hash.alg=1L, size=2^14))))
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 1.000 1.000 2.000 1.648 2.000 4.000
>
> # and this is what we see with an environment:
> > e = new.env(hash=T, size=2^14)
> > for (k in keys2) e[[k]] = k
> > summary(env.profile(e)$counts)
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 0.0000 0.0000 0.0000 0.7935 0.0000 2700.0000

With environments, if you use a prime number for the size you get considerably better results. E.g.,

> f <- function(size, keys2 = paste(seq(1000, length=13000), "at", sep="_")){

      if (!missing(size))
         e <- new.env(hash=T, size = size)
      else
         e <- new.env(hash=T)
      for(k in keys2) e[[k]] <- k
      table(env.profile(e)$counts)

}
> f(size=2^14)

    0 800 1200 1800 2700
16376 2 2 2 2
> f(16411) # next prime above 2^14

   0 1 2 3
7644 5110 3081 576
> f() # let new.env pick a size

   0 1 2 3 4 5
6514 1486 2717 1608 289 20

Perhaps new.env() should push the requested size up to the next prime by default.

(This is not to say your other changes are not improvements.)



Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and do  not necessarily reflect Insightful Corporation policy or position."



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Sun 22 Jul 2007 - 18:45:51 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Mon 23 Jul 2007 - 22:36:36 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.