Re: [Rd] shash in unique.c

From: Matthew Dowle <>
Date: Fri, 05 Mar 2010 12:40:42 +0000

Thanks a lot. Quick and brief responses below...

"Duncan Murdoch" <> wrote in message
> Matthew Dowle wrote:
>> I was hoping for a 'yes', 'no', 'maybe' or 'bad idea because ...'. No
>> response resulted in a retry() after a Sys.sleep(10 days).
>> If its a "yes" or "maybe" then I could proceed to try it, test it, and
>> present the test results and timings to you along with the patch. It
>> would be on 32bit Ubuntu first, and I would need to either buy, rent time
>> on, or borrow a 64bit machine to be able to then test there, owing to the
>> nature of the suggestion.
>> If its "no", "bad idea because..." or "we were already working on it, or
>> better", then I won't spend any more time on it.
>> Matthew
>> "Matthew Dowle" <> wrote in message
>> news:hlu4qh$l7s$
>>> Looking at shash in unique.c, from R-2.10.1 I'm wondering if it makes
>>> sense to hash the pointer itself rather than the string it points to?
>>> In other words could the SEXP pointer be cast to unsigned int and the
>>> usual scatter be called on that as if it were integer?
> Two negative but probably not fatal issues:
> Pointers and ints are not always the same size. In Win64, ints are 32
> bits, pointers are 64 bits. (Can we be sure there is some integer type
> the same size as a pointer? I don't know, ask a C expert.)
No we can't be sure. But we could test at runtime, and if the assumption wasn't true, then revert to the existing method.

> We might want to save the hash to disk. On restore, the pointer based
> hash would be all wrong. (I don't know if we actually do ever save a hash
> to disk. )
The hash table in unique.c appears to be a temporary private hash, different to the global R_StringHash. Its private hash appears to be used only while the call to unique runs, then free'd. Thats my understanding anyway. The suggestion is not to alter the global R_StringHash in any way at all, which is the one that might be saved to disk now or in the future.

> Duncan Murdoch
>>> shash would look like a slightly modified version of ihash like this :
>>> static int shash(SEXP x, int indx, HashData *d)
>>> {
>>> if (STRING_ELT(x,indx) == NA_STRING) return 0;
>>> return scatter((unsigned int) (STRING_ELT(x,indx), d);
>>> }
>>> rather than its current form which appears to hash the string it points
>>> to :
>>> static int shash(SEXP x, int indx, HashData *d)
>>> {
>>> unsigned int k;
>>> const char *p;
>>> if(d->useUTF8)
>>> p = translateCharUTF8(STRING_ELT(x, indx));
>>> else
>>> p = translateChar(STRING_ELT(x, indx));
>>> k = 0;
>>> while (*p++)
>>> k = 11 * k + *p; /* was 8 but 11 isn't a power of 2 */
>>> return scatter(k, d);
>>> }
>>> Looking at sequal, below, and reading its comments, if the pointers are
>>> equal it doesn't look at the strings they point to, which lead to the
>>> question above.
>>> static int sequal(SEXP x, int i, SEXP y, int j)
>>> {
>>> if (i < 0 || j < 0) return 0;
>>> /* Two strings which have the same address must be the same,
>>> so avoid looking at the contents */
>>> if (STRING_ELT(x, i) == STRING_ELT(y, j)) return 1;
>>> /* Then if either is NA the other cannot be */
>>> /* Once all CHARSXPs are cached, Seql will handle this */
>>> if (STRING_ELT(x, i) == NA_STRING || STRING_ELT(y, j) == NA_STRING)
>>> return 0;
>>> return Seql(STRING_ELT(x, i), STRING_ELT(y, j));
>>> }
>>> Matthew
>> ______________________________________________
>> mailing list
> mailing list Received on Fri 05 Mar 2010 - 12:43:28 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 05 Mar 2010 - 18:30:59 GMT.

Mailing list information is available at Please read the posting guide before posting to the list.

list of date sections of archive