Re: [Rd] shash in unique.c

From: Duncan Murdoch <murdoch_at_stats.uwo.ca>
Date: Fri, 05 Mar 2010 06:55:32 -0500

Matthew Dowle wrote:
> I was hoping for a 'yes', 'no', 'maybe' or 'bad idea because ...'. No
> response resulted in a retry() after a Sys.sleep(10 days).
>
> If its a "yes" or "maybe" then I could proceed to try it, test it, and
> present the test results and timings to you along with the patch. It would
> be on 32bit Ubuntu first, and I would need to either buy, rent time on, or
> borrow a 64bit machine to be able to then test there, owing to the nature of
> the suggestion.
>
> If its "no", "bad idea because..." or "we were already working on it, or
> better", then I won't spend any more time on it.
>
> Matthew
>
>
> "Matthew Dowle" <mdowle_at_mdowle.plus.com> wrote in message
> news:hlu4qh$l7s$1_at_dough.gmane.org...
>
>> Looking at shash in unique.c, from R-2.10.1 I'm wondering if it makes
>> sense to hash the pointer itself rather than the string it points to?
>> In other words could the SEXP pointer be cast to unsigned int and the
>> usual scatter be called on that as if it were integer?
>>
Two negative but probably not fatal issues:

Pointers and ints are not always the same size. In Win64, ints are 32 bits, pointers are 64 bits. (Can we be sure there is some integer type the same size as a pointer? I don't know, ask a C expert.)

We might want to save the hash to disk. On restore, the pointer based hash would be all wrong. (I don't know if we actually do ever save a hash to disk. )

Duncan Murdoch
>> shash would look like a slightly modified version of ihash like this :
>>
>> static int shash(SEXP x, int indx, HashData *d)
>> {
>> if (STRING_ELT(x,indx) == NA_STRING) return 0;
>> return scatter((unsigned int) (STRING_ELT(x,indx), d);
>> }
>>
>> rather than its current form which appears to hash the string it points to
>> :
>>
>> static int shash(SEXP x, int indx, HashData *d)
>> {
>> unsigned int k;
>> const char *p;
>> if(d->useUTF8)
>> p = translateCharUTF8(STRING_ELT(x, indx));
>> else
>> p = translateChar(STRING_ELT(x, indx));
>> k = 0;
>> while (*p++)
>> k = 11 * k + *p; /* was 8 but 11 isn't a power of 2 */
>> return scatter(k, d);
>> }
>>
>> Looking at sequal, below, and reading its comments, if the pointers are
>> equal it doesn't look at the strings they point to, which lead to the
>> question above.
>>
>> static int sequal(SEXP x, int i, SEXP y, int j)
>> {
>> if (i < 0 || j < 0) return 0;
>> /* Two strings which have the same address must be the same,
>> so avoid looking at the contents */
>> if (STRING_ELT(x, i) == STRING_ELT(y, j)) return 1;
>> /* Then if either is NA the other cannot be */
>> /* Once all CHARSXPs are cached, Seql will handle this */
>> if (STRING_ELT(x, i) == NA_STRING || STRING_ELT(y, j) == NA_STRING)
>> return 0;
>> return Seql(STRING_ELT(x, i), STRING_ELT(y, j));
>> }
>>
>> Matthew
>>
>>
>
> ______________________________________________
> R-devel_at_r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



R-devel_at_r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel Received on Fri 05 Mar 2010 - 11:58:18 GMT

Archive maintained by Robert King, hosted by the discipline of statistics at the University of Newcastle, Australia.
Archive generated by hypermail 2.2.0, at Fri 05 Mar 2010 - 13:40:57 GMT.

Mailing list information is available at https://stat.ethz.ch/mailman/listinfo/r-devel. Please read the posting guide before posting to the list.

list of date sections of archive