Hash-based ordering discussion

ggevay · ggevay · commit dd26398b0970 · 2023-03-13T12:28:11.000+01:00
diff --git a/doc/developer/design/20230110_window_functions.md b/doc/developer/design/20230110_window_functions.md
@@ -337,6 +337,11 @@ A better way to solve the problem is to fix a deterministic order of rows inside
 Hash collisions will be resolved by an extra Reduce beforehand, which groups by hash value, and adds a few more bits (e.g., 8) to differentiate records within a collision group. If the collision resolution bits are not enough, i.e., there is a hash value that occurs more times than is representable by the collision resolution bits, then we error out.
 Therefore, we'll have to determine the exact number of bits of the hash function's output as well as the number of collision resolution bits in a way that the chances of the collision resolution bits not being enough will be astronomically small for any realistically sized peer groups. My intuition is that 32 bits of hash + 8 bits of collision resolution are enough for peer groups of hundreds of millions, but [I'll make an exact calculation](https://oeis.org/A225871).
 
+Ordering based on hash vales is discouraged in general, because of the danger of order changes between different Materialize versions. However, in this particular situation, the benefits seem to outweigh the potential issues.
+First, note that in this situation, changing hashes won't cause plan changes, and thus sudden plan regressions are not possible. What changing hashes _can_ cause here is changing output (e.g., LAG grabbing a different value from a previous row). Changing outputs are going to be a fact of life for a long time for other reasons as well (e.g., fixing bugs in any part of the system), and therefore the system should, in general, be well-prepared for it. (For example, this is the reason why the persist sink [was designed to be self-correcting](https://www.notion.so/materialize/distributed-self-correcting-persist_sink-d3d59834ed9d47d397143c738e9d6c9d).) Also note that even the `Ord` of `Datum` is not perfectly stable: [it happened before](https://github.com/MaterializeInc/materialize/pull/16810) that it changed between Materialize versions.
+
+Still, we should make a reasonable effort to keep `Datum` hashes stable. An extreme approach would be to add a manually-maintained hash function to `Datum`, and then commit to keeping it stable across internal representation changes of `Datum`. I think we shouldn't do this at this point in time, because this would introduce an undue maintenance burden. On the other end of the spectrum of possible hash functions would be simply relying on the derived hash function of the standard library. However, the standard hashes can change very often, even when the internal representation of `Datum` doesn't change, but just due to e.g., compiler version changes. A middle-ground solution would be to use the [stable_hash](https://docs.rs/stable-hash/latest/stable_hash/) library. This avoids changes "across minor versions of this library, even when the compiler, process, architecture, or std lib does change", as well as for certain very simple schema changes.
+
 ### ORDER BY types
 
 Our prefix sum algorithm operates with indexes that are fixed-length bit vectors, which is a fundamental limitation of the algorithm. (The current implementation has `usize` hardcoded. We will generalize this to longer bit vectors, but they will still have to be fixed-length.) Therefore, any type that we would like to support in the ORDER BY clause of a window function executed by prefix sum will need to be mapped to fixed-length bit vectors. This unfortunately means that variable-length types, such as String, Array, List, Map, Bytes, won't be supported by prefix sum. For such types, we will fall back to the old, naive rendering (ideally, with a warning printed to the user, and possibly a Sentry log).
@@ -582,8 +587,6 @@ There are many window functions, and many frame options. We will gradually add t
 
 # Open questions
 
-Is it ok that the order within a peer group will be determined by hashes that might be hard to keep stable between versions?
-
 We should check that there is correct parallelization inside window partitions.
 
 How to have automated performance tests? How can we check in Testdrive that some materialized view (that has window functions) is being updated fast enough? (This is not critical for the first version; we'll use manual performance tests.)