Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions Lib/test/test_dict.py
Original file line number Diff line number Diff line change
Expand Up @@ -1903,6 +1903,17 @@ def test_hash(self):
self.assertEqual(hash(frozendict(x=1, y=2)),
hash(frozendict(y=2, x=1)))

# Check that hash() computes the hash of (key, value) pairs
cases = [
frozendict(a=False, b=True, c=True),
frozendict(a=True, b=False, c=True),
frozendict(a=True, b=True, c=False),
frozendict({False: "a", "b": True, "c": True}),
frozendict({"a": "b", False: True, True: "c"}),
]
hashes = {hash(fd) for fd in cases}
self.assertEqual(len(hashes), len(cases))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps simpler, just verify the PEP 814 guarantee that hash(fd) == hash(frozenset(fd.items()))? There's no strict rule that says none of these hashes can collide, especially with the vagaries of the seeded string hashes, but we can verify that the behavior matches the PEP 814 spec and trust that the frozenset and tuple hash algorithms are adequate. This also provides a good test to catch if someone does update the tuple or frozenset hashing in a way that would break compatibility.


fd = frozendict(x=[1], y=[2])
with self.assertRaisesRegex(TypeError, "unhashable type: 'list'"):
hash(fd)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Fix ``hash(frozendict)``: compute the hash of each ``(key, value)`` pair
correctly. Patch by Victor Stinner.
46 changes: 37 additions & 9 deletions Objects/dictobject.c
Original file line number Diff line number Diff line change
Expand Up @@ -8228,6 +8228,40 @@ _shuffle_bits(Py_uhash_t h)
return ((h ^ 89869747UL) ^ (h << 16)) * 3644798167UL;
}

// Compute hash((key, value)).
// Code copied from tuple_hash().
static Py_hash_t
frozendict_pair_hash(PyObject *key, PyObject *value)
{
Py_ssize_t len = 2;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps make this a static const since it's not intended to mutated, it's just a constant? I expect most optimizing compilers to realize they can just inline len regardless (being a local that they can see is never changed), but why not make sure of it, so acc += len ^ (_PyTuple_HASH_XXPRIME_5 ^ 3527539UL); is definitely just an addition, not a couple of xors first?

Py_uhash_t acc = _PyTuple_HASH_XXPRIME_5;

Py_uhash_t lane = PyObject_Hash(key);
if (lane == (Py_uhash_t)-1) {
return -1;
}
acc += lane * _PyTuple_HASH_XXPRIME_2;
acc = _PyTuple_HASH_XXROTATE(acc);
acc *= _PyTuple_HASH_XXPRIME_1;

lane = PyObject_Hash(value);
if (lane == (Py_uhash_t)-1) {
return -1;
}
acc += lane * _PyTuple_HASH_XXPRIME_2;
acc = _PyTuple_HASH_XXROTATE(acc);
acc *= _PyTuple_HASH_XXPRIME_1;

/* Add input length, mangled to keep the historical value of hash(()). */
acc += len ^ (_PyTuple_HASH_XXPRIME_5 ^ 3527539UL);

if (acc == (Py_uhash_t)-1) {
acc = 1546275796;
}
return acc;
}


// Code copied from frozenset_hash()
static Py_hash_t
frozendict_hash(PyObject *op)
Expand All @@ -8244,17 +8278,11 @@ frozendict_hash(PyObject *op)
PyObject *key, *value; // borrowed refs
Py_ssize_t pos = 0;
while (PyDict_Next(op, &pos, &key, &value)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to do _PyDict_Next(op, &pos, NULL, &value, &keyhash), avoiding retrieving the key entirely (_PyDict_Next allows passing NULL for the key pointer, value pointer, or hash pointer; PyDict_Next itself is just calling it with NULL for the hash pointer) in favor of solely retrieving its hash? This then simplifies frozendict_pair_hash by letting it take the key hash rather than the key, saving a PyObject_Hash call and the associated error-checking, and letting you directly initialize Py_uhash_t acc = _PyTuple_HASH_XXROTATE(_PyTuple_HASH_XXPRIME_5 + keyhash * _PyTuple_HASHXXPRIME_2) * _PyTuple_HASH_XXPRIME_1; (maybe split up a bit, but you'd have everything you needed for all those steps up front, and could reduce eight lines to just 1-3). Sure, some types store their cached hash internally so PyObject_Hash is low cost for them, but dict's internal iteration guarantees zero cost, so why not take advantage of it?

Py_hash_t key_hash = PyObject_Hash(key);
if (key_hash == -1) {
return -1;
}
hash ^= _shuffle_bits(key_hash);

Py_hash_t value_hash = PyObject_Hash(value);
if (value_hash == -1) {
Py_hash_t pair_hash = frozendict_pair_hash(key, value);
if (pair_hash == -1) {
return -1;
}
hash ^= _shuffle_bits(value_hash);
hash ^= _shuffle_bits(pair_hash);
}

/* Factor in the number of active entries */
Expand Down
3 changes: 3 additions & 0 deletions Objects/tupleobject.c
Original file line number Diff line number Diff line change
Expand Up @@ -363,6 +363,9 @@ tuple_repr(PyObject *self)
https://github.com/Cyan4973/xxHash/blob/master/doc/xxhash_spec.md
The constants for the hash function are defined in pycore_tuple.h.
If you update this code, update also frozendict_pair_hash() which copied
this code.
*/

static Py_hash_t
Expand Down
Loading