-
Notifications
You must be signed in to change notification settings - Fork 482
Add window functions design doc for prefix sum #17077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
de778a0 to
0fa49b5
Compare
|
This looks great! I'm not yet hitting approve as I need to ponder on what steps this should improve. @frankmcsherry, if you're happy with the contents, can you LGTM? Thanks! |
|
@antiguru The MIR and LIR representations are still open:
I updated the design doc with the above. @frankmcsherry, could you please take a look? And then we can hopefully merge the PR, and then I'll start coding the representations tomorrow. |
cc61510 to
2d51ab0
Compare
frankmcsherry
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got this far before I need a break. One thing that is challenging is that the format is not prose but rather bulleted lists. I have a hard time reading this and trying to assess what points are being made, whether they are clearly made, whether we are moving on to the next point or musing on something else. I do understand what you want to do, and we are getting to the point where I'll say "yes" just to not have to fight it, but I think there is a legit exercise in recording the reasoning that we want to do before we move forward.
| - For LAG/LEAD with an offset of 1, the sum function will just remember the previous value. (And we can have a similar one for LEAD.) This looks a bit weird as a sum function, but it will work: | ||
| - it's associative; | ||
| - It’s not commutative, but that shouldn’t be an issue for Prefix Sum; | ||
| - It doesn’t have a zero element if implemented naively, but we can add one artificially: Normally we always return the right argument, but if the right argument is the artificially added zero element, then we return the left argument. Note: adding the zero element is important, because prefix sum sometimes sums in the zero element not just at the beginning, but randomly in the middle of an addition chain. E.g., when having *a, b, c* in the prefix then we might expect simply *a+b+c* or maybe *z+a+b+c* to be the prefix sum, but actually our Prefix Sum implementation might give us something like *z+a+b+z+c+z*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this maybe a long way to say that the type should be Option<Val>? When I find myself in this situation, I try and return to the first moment where I said something funny and fix it. I think above, it is where you say "the sum function will just remember the previous value", and instead could have written ".. the previous value if it exists, and None if it does not". Then we don't have to patch up that explanation later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks, simplified the text.
| - I built [a small prototype outside Materialize](https://github.com/ggevay/window-funcs), where I verified that the output values are correct, and that output is quickly updated for small input changes. | ||
| - For LAG/LEAD with offset *k > 1* (which computes the given expression not for the previous record, but for the record that was *k* records ago): | ||
| - A sum function could simply remember the last *k* values. This works kind of ok for small *k*. | ||
| - There is a technical trick to make the overhead smaller by avoiding doing many copies on the list that is keeping the last **k** values: We could use such a reduction that has a `&mut` argument, and keep the last **k** values in that argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not understand this sentence. Is it significant that here you use k rather than k in the line above? It is perhaps fine to not have the technical tricks in the design doc, unless they are important for the viability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The k and k difference came about somehow by accident when copy-pasting from Notion.
|
|
||
| We’ll use the word ****index**** in the below text to mean the values of the ORDER BY column of the OVER clause, i.e., they are simply the values that determine the ordering. (Note that it’s sparse indexing, i.e., not every number occurs from 1 to n, but there are usually (big) gaps.) | ||
|
|
||
| Now I’ll list all window functions, and how we’ll support them with one of the above approaches: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In reading this, it starts as an enumeration of window functions, but turns in to a bullet salad. If at all possible, it would help to decouple the moments that are meant to be enumerations from the associated thoughts, to make it easier for the reader to assess the exhaustiveness / find the special case they are looking to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry, I didn't realize how annoying can a bullet salad be. I completely changed the structuring. The "How to handle each window function" section has numbered subsections, enumerating all window functions (grouped into categories). There are also two big horizontal lines delineating this big enumeration section.
|
|
||
| **Duplicate indexes.** There might be rows which agree on both the PARTITION BY key and the ORDER BY key (the index). Prefix Sum doesn’t handle such duplicate indexes. | ||
|
|
||
| - To eliminate these duplicates, we will number the elements inside each group with 0, 1, 2, …, and this will be an additional component of the prefix sum indexes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we do elsewhere is extend any ordering by an order on Row. Given the requirement on using integers for the ordering key, this could potentially just be a hash of the row, appended after the primary integer/integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with that is that the entire Row might match. Note that in other ordering situations this possibility is not a problem, because if the entire Row matches, then there is no difference between ordering them one way or the other. But the issue here is not ordering stability, but that the current Prefix Sum implementation just cannot handle duplicate ordering keys.
I added a note that an alternative would be to make Prefix Sum correctly handle duplicate ordering keys.
| - Compared to 3., it might be easier to skip windowing in many transforms. This is both good and bad: | ||
| - We can get a first version done more quickly. (And then potentially add optimizations later.) | ||
| - But we might leave some easy optimization opportunities on the table, which would come from already-existing transform code for `Reduce`. | ||
| - A new `MirRelationExpr` variant would mean we have to modify about 12-14 transforms (`LetRec` is pattern-matched in 12 files in the `transform` crate, `TopK` 14 times. See also [the recent LetRec addition](https://github.com/MaterializeInc/materialize/commit/9ac8e060d82487752ba28c42f7b146ff9f730ca3) for an example of how it looks when we add a new `MirRelationExpr` variant.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, LetRec is disabled in all but a few transforms because we don't know that it is correct without a more thorough audit of the transforms. We have non-exhaustive matches (e.g. predicate pushdown) where I suspect it is not correct, and .. we'll have to do the same sort of checking for a WindowFunction variant).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the text to clarify this.
|
Thank you very much for the comments, @frankmcsherry! I addressed all of them. I think most things are resolved, I only have a few minor questions.
Yes. I've changed the design doc to go with the idiom recognition approach.
Sorry, I don't really understand this. So we have Do you mean that we should change the
In the meantime, I'm thinking to do the idiom recognition in the MIR-to-LIR lowering, and then there is no type problem. Is that ok? (Also, AFAIU we shouldn't have conditional code in the rendering, and this idiom recognition will be a giant piece of conditional code, so it seems to belong more in the lowering. Also, we want EXPLAIN PHYSICAL PLAN to show how we'll execute a window function.) |
22fafde to
9bcfd69
Compare
antiguru
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't finish reading, but I wanted to leave some early feedback. I think this is on track to becoming a good design doc, but there's work left to make the content more approachable. I'll continue tomorrow!
|
|
||
| [Many users want to use window functions](https://www.notion.so/Window-Functions-Use-Cases-6ad1846a7da942dc8fa28997d9c220dd), but our current window function support is very inefficient: We recompute results for an entire window partition for any small change in the partition. So the only situations when our current support works is if the window partitions are either very small, or they rarely change. | ||
|
|
||
| ## Goals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section seems to mix goals with implementation, for example that some things are easy and others are not, or that prefix_sum will handle things for us. Ideally, this section lays out what we'll have after applying the change to Materialize that we didn't have before. The actual mapping should be part of the design section I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly separated them, but I think a few short hints to which stuff will be supported by prefix sum vs. translating away window functions is still fine to have here.
| To have better performance (and to support non-invertible aggregations, e.g., min/max), we need to extend what the `broadcast` part of prefix sum is doing (`aggregate` can stay the same): | ||
| - `queries` will contain intervals specified by two indexes. | ||
| - `requests`: We can similarly compute a set of requests from `queries`. The change will only be inside the `flat_map`. | ||
| - `full_ranges`, `zero_ranges`, `used_ranges` stay the same. | ||
| - `init_states` won’t start at position 0, but at the lower end of the intervals in `queries` | ||
| - The iteration at the end will be mostly the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is specific to to implementation and doesn't make sense for someone who didn't study it. I think you need an explanation of the prefix_sum operator that you suggest we use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an explanation for prefix_sum's implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will also improve this explanation a bit.
antiguru
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more comments. I really think we should start from a problem statement, a discussion of prefix sum, which combined can then result in a solution to said problem. My goal of this effort is to make sure that we consider alternatives and discuss the design such that we avoid surprises on the way, and that we have a reference that allows others to pick up this work at a later point in time. It doesn't need to be the documentation, because design docs are a point-in-time artifact, but it should enable us to get there!
|
|
||
| I built [a small prototype outside Materialize](https://github.com/ggevay/window-funcs), where I verified that the output values are correct, and that output is quickly updated for small input changes. | ||
|
|
||
| For LAG/LEAD with *k > 1* (which computes the given expression not for the previous record, but for the record that was *k* records ago), the sum function could simply remember the last *k* values, acting on a `Vec<Val>` of length at most *k*, which would generalize `Option<Val>`. This works kind of ok for small *k*. A more complicated but probably better solution is to find the index for that element in the window partition that is *k* elements behind by using the same method as we use for calculating the intervals of the framed window functions (see below). Then with the index in hand, we can just do a self-join. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems an important part of your design, but for me it could use more explanation. Firstly, what does the function look like? Does it need to maintain a vector, or does it need to maintain an order? Do we insert and delete at the end? It influences what data structure you'll use.
Can you elaborate on the self-join part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will clarify this.
|
|
||
| ### Parallelism | ||
|
|
||
| DD's Prefix Sum should be data-parallel even inside a window partition. (It’s similar to [a Fenwick tree](https://en.wikipedia.org/wiki/Fenwick_tree), with sums maintained over power-of-2 sized intervals, from which you can compute a prefix sum by putting together LogN intervals.) TODO: But I wasn't able to actually observe a speedup in a simple test when adding cores, so we should investigate what’s going on with parallelization. There was probably just some technical issue, because all operations in the Prefix Sum implementation look parallelizable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like you don't want to make it part of the design, but I strongly suggest that any implementation includes a performance evaluation, and that we outline expected performance here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, yes, I added much more details about performance. Will add more when I have a prototype.
9f4b35a to
679c330
Compare
e0b3654 to
e641252
Compare
2f1cee6 to
264f9c4
Compare
264f9c4 to
97b45ff
Compare
d0d3322 to
71a6ce6
Compare
71a6ce6 to
4ac23ed
Compare
This PR moves the window functions design doc from Notion to a markdown file.
Rendered: https://github.com/ggevay/materialize/blob/windowing-design-doc/doc/developer/design/20230110_window_functions.md
(Windowing EPIC: https://github.com/MaterializeInc/database-issues/issues/4732)