You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/developer/design/20230110_window_functions.md
+5-2Lines changed: 5 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -337,6 +337,11 @@ A better way to solve the problem is to fix a deterministic order of rows inside
337
337
Hash collisions will be resolved by an extra Reduce beforehand, which groups by hash value, and adds a few more bits (e.g., 8) to differentiate records within a collision group. If the collision resolution bits are not enough, i.e., there is a hash value that occurs more times than is representable by the collision resolution bits, then we error out.
338
338
Therefore, we'll have to determine the exact number of bits of the hash function's output as well as the number of collision resolution bits in a way that the chances of the collision resolution bits not being enough will be astronomically small for any realistically sized peer groups. My intuition is that 32 bits of hash + 8 bits of collision resolution are enough for peer groups of hundreds of millions, but [I'll make an exact calculation](https://oeis.org/A225871).
339
339
340
+
Ordering based on hash vales is discouraged in general, because of the danger of order changes between different Materialize versions. However, in this particular situation, the benefits seem to outweigh the potential issues.
341
+
First, note that in this situation, changing hashes won't cause plan changes, and thus sudden plan regressions are not possible. What changing hashes _can_ cause here is changing output (e.g., LAG grabbing a different value from a previous row). Changing outputs are going to be a fact of life for a long time for other reasons as well (e.g., fixing bugs in any part of the system), and therefore the system should, in general, be well-prepared for it. (For example, this is the reason why the persist sink [was designed to be self-correcting](https://www.notion.so/materialize/distributed-self-correcting-persist_sink-d3d59834ed9d47d397143c738e9d6c9d).) Also note that even the `Ord` of `Datum` is not perfectly stable: [it happened before](https://github.com/MaterializeInc/materialize/pull/16810) that it changed between Materialize versions.
342
+
343
+
Still, we should make a reasonable effort to keep `Datum` hashes stable. An extreme approach would be to add a manually-maintained hash function to `Datum`, and then commit to keeping it stable across internal representation changes of `Datum`. I think we shouldn't do this at this point in time, because this would introduce an undue maintenance burden. On the other end of the spectrum of possible hash functions would be simply relying on the derived hash function of the standard library. However, the standard hashes can change very often, even when the internal representation of `Datum` doesn't change, but just due to e.g., compiler version changes. A middle-ground solution would be to use the [stable_hash](https://docs.rs/stable-hash/latest/stable_hash/) library. This avoids changes "across minor versions of this library, even when the compiler, process, architecture, or std lib does change", as well as for certain very simple schema changes.
344
+
340
345
### ORDER BY types
341
346
342
347
Our prefix sum algorithm operates with indexes that are fixed-length bit vectors, which is a fundamental limitation of the algorithm. (The current implementation has `usize` hardcoded. We will generalize this to longer bit vectors, but they will still have to be fixed-length.) Therefore, any type that we would like to support in the ORDER BY clause of a window function executed by prefix sum will need to be mapped to fixed-length bit vectors. This unfortunately means that variable-length types, such as String, Array, List, Map, Bytes, won't be supported by prefix sum. For such types, we will fall back to the old, naive rendering (ideally, with a warning printed to the user, and possibly a Sentry log).
@@ -582,8 +587,6 @@ There are many window functions, and many frame options. We will gradually add t
582
587
583
588
# Open questions
584
589
585
-
Is it ok that the order within a peer group will be determined by hashes that might be hard to keep stable between versions?
586
-
587
590
We should check that there is correct parallelization inside window partitions.
588
591
589
592
How to have automated performance tests? How can we check in Testdrive that some materialized view (that has window functions) is being updated fast enough? (This is not critical for the first version; we'll use manual performance tests.)
0 commit comments