Re: [Rd] shash in unique.c

From: Matthew Dowle <mdowle_at_mdowle.plus.com>
Date: Fri, 05 Mar 2010 12:40:42 +0000

Thanks a lot. Quick and brief responses below...

"Duncan Murdoch" <murdoch_at_stats.uwo.ca> wrote in message news:4B90F134.6070004_at_stats.uwo.ca...
> Matthew Dowle wrote:
>> I was hoping for a 'yes', 'no', 'maybe' or 'bad idea because ...'. No
>> response resulted in a retry() after a Sys.sleep(10 days).
>>
>> If its a "yes" or "maybe" then I could proceed to try it, test it, and
>> present the test results and timings to you along with the patch. It
>> would be on 32bit Ubuntu first, and I would need to either buy, rent time
>> on, or borrow a 64bit machine to be able to then test there, owing to the
>> nature of the suggestion.
>>
>> If its "no", "bad idea because..." or "we were already working on it, or
>> better", then I won't spend any more time on it.
>>
>> Matthew
>>
>>
>> "Matthew Dowle" <mdowle_at_mdowle.plus.com> wrote in message
>> news:hlu4qh$l7s$1_at_dough.gmane.org...
>>
>>> Looking at shash in unique.c, from R-2.10.1 I'm wondering if it makes
>>> sense to hash the pointer itself rather than the string it points to?
>>> In other words could the SEXP pointer be cast to unsigned int and the
>>> usual scatter be called on that as if it were integer?
>>>
> Two negative but probably not fatal issues:
>
> Pointers and ints are not always the same size. In Win64, ints are 32
> bits, pointers are 64 bits. (Can we be sure there is some integer type
> the same size as a pointer? I don't know, ask a C expert.)
No we can't be sure. But we could test at runtime, and if the assumption wasn't true, then revert to the existing method.

>
> We might want to save the hash to disk. On restore, the pointer based
> hash would be all wrong. (I don't know if we actually do ever save a hash
> to disk. )
The hash table in unique.c appears to be a temporary private hash, different to the global R_StringHash. Its private hash appears to be used only while the call to unique runs, then free'd. Thats my understanding anyway. The suggestion is not to alter the global R_StringHash in any way at all, which is the one that might be saved to disk now or in the future.

>
> Duncan Murdoch
>>> shash would look like a slightly modified version of ihash like this :
>>>
>>> static int shash(SEXP x, int indx, HashData *d)
>>> {
>>> if (STRING_ELT(x,indx) == NA_STRING) return 0;
>>> return scatter((unsigned int) (STRING_ELT(x,indx), d);
>>> }
>>>
>>> rather than its current form which appears to hash the string it points
>>> to :
>>>
>>> static int shash(SEXP x, int indx, HashData *d)
>>> {
>>> unsigned int k;
>>> const char *p;
>>> if(d->useUTF8)
>>> p = translateCharUTF8(STRING_ELT(x, indx));
>>> else
>>> p = translateChar(STRING_ELT(x, indx));
>>> k = 0;
>>> while (*p++)
>>> k = 11 * k + *p; /* was 8 but 11 isn't a power of 2 */
>>> return scatter(k, d);
>>> }
>>>
>>> Looking at sequal, below, and reading its comments, if the pointers are
>>> equal it doesn't look at the strings they point to, which lead to the
>>> question above.
>>>
>>> static int sequal(SEXP x, int i, SEXP y, int j)
>>> {
>>> if (i < 0 || j < 0) return 0;
>>> /* Two strings which have the same address must be the same,
>>> so avoid looking at the contents */
>>> if (STRING_ELT(x, i) == STRING_ELT(y, j)) return 1;
>>> /* Then if either is NA the other cannot be */
>>> /* Once all CHARSXPs are cached, Seql will handle this */
>>> if (STRING_ELT(x, i) == NA_STRING || STRING_ELT(y, j) == NA_STRING)
>>> return 0;
>>> return Seql(STRING_ELT(x, i), STRING_ELT(y, j));
>>> }
>>>
>>> Matthew
>>>
>>>
>>
>> ______________________________________________
>> R-devel_at_r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Fri 05 Mar 2010 - 12:43:28 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 05 Mar 2010 - 18:30:59 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive