Skip to content

Conversation

@ggevay
Copy link
Contributor

@ggevay ggevay commented Jan 10, 2023

@ggevay ggevay added the A-compute Area: compute label Jan 10, 2023
@ggevay ggevay force-pushed the windowing-design-doc branch 5 times, most recently from de778a0 to 0fa49b5 Compare January 11, 2023 16:52
@antiguru
Copy link
Member

This looks great! I'm not yet hitting approve as I need to ponder on what steps this should improve. @frankmcsherry, if you're happy with the contents, can you LGTM? Thanks!

@ggevay
Copy link
Contributor Author

ggevay commented Jan 16, 2023

@antiguru The MIR and LIR representations are still open:

  • MIR: I think the top option now for MIR is kind of a hybrid between 1. and 3.: create a many-to-many Reduce as a new MirRelationExpr enum variant. This variant would initially only contain windowing, but after the windowing is done, we would probably merge TopK into this.
  • LIR: We haven't discussed much with @frankmcsherry about this yet, but it seems to me that a new variant would be the best option, because the rendering of the current Reduce or TopK wouldn't share any code with the rendering of windowing. (But in any case, this seems like a less critical question as MIR, since we don't have LIR transforms.)

I updated the design doc with the above. @frankmcsherry, could you please take a look? And then we can hopefully merge the PR, and then I'll start coding the representations tomorrow.

@ggevay ggevay force-pushed the windowing-design-doc branch 3 times, most recently from cc61510 to 2d51ab0 Compare January 17, 2023 18:11
Copy link
Contributor

@frankmcsherry frankmcsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got this far before I need a break. One thing that is challenging is that the format is not prose but rather bulleted lists. I have a hard time reading this and trying to assess what points are being made, whether they are clearly made, whether we are moving on to the next point or musing on something else. I do understand what you want to do, and we are getting to the point where I'll say "yes" just to not have to fight it, but I think there is a legit exercise in recording the reasoning that we want to do before we move forward.

- For LAG/LEAD with an offset of 1, the sum function will just remember the previous value. (And we can have a similar one for LEAD.) This looks a bit weird as a sum function, but it will work:
- it's associative;
- It’s not commutative, but that shouldn’t be an issue for Prefix Sum;
- It doesn’t have a zero element if implemented naively, but we can add one artificially: Normally we always return the right argument, but if the right argument is the artificially added zero element, then we return the left argument. Note: adding the zero element is important, because prefix sum sometimes sums in the zero element not just at the beginning, but randomly in the middle of an addition chain. E.g., when having *a, b, c* in the prefix then we might expect simply *a+b+c* or maybe *z+a+b+c* to be the prefix sum, but actually our Prefix Sum implementation might give us something like *z+a+b+z+c+z*.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this maybe a long way to say that the type should be Option<Val>? When I find myself in this situation, I try and return to the first moment where I said something funny and fix it. I think above, it is where you say "the sum function will just remember the previous value", and instead could have written ".. the previous value if it exists, and None if it does not". Then we don't have to patch up that explanation later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks, simplified the text.

- I built [a small prototype outside Materialize](https://github.com/ggevay/window-funcs), where I verified that the output values are correct, and that output is quickly updated for small input changes.
- For LAG/LEAD with offset *k > 1* (which computes the given expression not for the previous record, but for the record that was *k* records ago):
- A sum function could simply remember the last *k* values. This works kind of ok for small *k*.
- There is a technical trick to make the overhead smaller by avoiding doing many copies on the list that is keeping the last **k** values: We could use such a reduction that has a `&mut` argument, and keep the last **k** values in that argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not understand this sentence. Is it significant that here you use k rather than k in the line above? It is perhaps fine to not have the technical tricks in the design doc, unless they are important for the viability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The k and k difference came about somehow by accident when copy-pasting from Notion.


We’ll use the word ****index**** in the below text to mean the values of the ORDER BY column of the OVER clause, i.e., they are simply the values that determine the ordering. (Note that it’s sparse indexing, i.e., not every number occurs from 1 to n, but there are usually (big) gaps.)

Now I’ll list all window functions, and how we’ll support them with one of the above approaches:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reading this, it starts as an enumeration of window functions, but turns in to a bullet salad. If at all possible, it would help to decouple the moments that are meant to be enumerations from the associated thoughts, to make it easier for the reader to assess the exhaustiveness / find the special case they are looking to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry, I didn't realize how annoying can a bullet salad be. I completely changed the structuring. The "How to handle each window function" section has numbered subsections, enumerating all window functions (grouped into categories). There are also two big horizontal lines delineating this big enumeration section.


**Duplicate indexes.** There might be rows which agree on both the PARTITION BY key and the ORDER BY key (the index). Prefix Sum doesn’t handle such duplicate indexes.

- To eliminate these duplicates, we will number the elements inside each group with 0, 1, 2, …, and this will be an additional component of the prefix sum indexes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we do elsewhere is extend any ordering by an order on Row. Given the requirement on using integers for the ordering key, this could potentially just be a hash of the row, appended after the primary integer/integers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with that is that the entire Row might match. Note that in other ordering situations this possibility is not a problem, because if the entire Row matches, then there is no difference between ordering them one way or the other. But the issue here is not ordering stability, but that the current Prefix Sum implementation just cannot handle duplicate ordering keys.

I added a note that an alternative would be to make Prefix Sum correctly handle duplicate ordering keys.

- Compared to 3., it might be easier to skip windowing in many transforms. This is both good and bad:
- We can get a first version done more quickly. (And then potentially add optimizations later.)
- But we might leave some easy optimization opportunities on the table, which would come from already-existing transform code for `Reduce`.
- A new `MirRelationExpr` variant would mean we have to modify about 12-14 transforms (`LetRec` is pattern-matched in 12 files in the `transform` crate, `TopK` 14 times. See also [the recent LetRec addition](https://github.com/MaterializeInc/materialize/commit/9ac8e060d82487752ba28c42f7b146ff9f730ca3) for an example of how it looks when we add a new `MirRelationExpr` variant.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, LetRec is disabled in all but a few transforms because we don't know that it is correct without a more thorough audit of the transforms. We have non-exhaustive matches (e.g. predicate pushdown) where I suspect it is not correct, and .. we'll have to do the same sort of checking for a WindowFunction variant).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the text to clarify this.

@ggevay
Copy link
Contributor Author

ggevay commented Jan 25, 2023

Thank you very much for the comments, @frankmcsherry! I addressed all of them. I think most things are resolved, I only have a few minor questions.

To loop back around, one imo totally valid proposal is to change nothing about the IR and use idiom detection in rendering, or potentially just in LIR.

Yes. I've changed the design doc to go with the idiom recognition approach.

Looking at the code snippet below, the only defect seems to be that we don't absorb Project into FlatMap, which we should do in any case.

Sorry, I don't really understand this. So we have

   Project (#3..=#5)                                                                             +
     Map (record_get[1](#1), record_get[0](#2), record_get[2](#2), record_get[0](#1))            +
       FlatMap unnest_list(#0)                                                                   +
         Project (#1)       

Do you mean that we should change the FlatMap to refer to #1 instead of #0? But then we'll still have the extra column after the FlatMap is done, because FlatMap appends a column (similarly to Map). I don't see how we could make the FlatMap perform also the projection. Maybe the Project after the Map could project away the extra column, but then I wouldn't call this "absorbing the Project", but it would be something like projection lifting.

My understanding is that the specific reason this doesn't work for you is that someone needs to name a specific type for the ordering key, and the types are no longer visible in LIR if they are not captured.

In the meantime, I'm thinking to do the idiom recognition in the MIR-to-LIR lowering, and then there is no type problem. Is that ok? (Also, AFAIU we shouldn't have conditional code in the rendering, and this idiom recognition will be a giant piece of conditional code, so it seems to belong more in the lowering. Also, we want EXPLAIN PHYSICAL PLAN to show how we'll execute a window function.)

@ggevay ggevay force-pushed the windowing-design-doc branch from 22fafde to 9bcfd69 Compare January 31, 2023 12:10
Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't finish reading, but I wanted to leave some early feedback. I think this is on track to becoming a good design doc, but there's work left to make the content more approachable. I'll continue tomorrow!


[Many users want to use window functions](https://www.notion.so/Window-Functions-Use-Cases-6ad1846a7da942dc8fa28997d9c220dd), but our current window function support is very inefficient: We recompute results for an entire window partition for any small change in the partition. So the only situations when our current support works is if the window partitions are either very small, or they rarely change.

## Goals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section seems to mix goals with implementation, for example that some things are easy and others are not, or that prefix_sum will handle things for us. Ideally, this section lays out what we'll have after applying the change to Materialize that we didn't have before. The actual mapping should be part of the design section I believe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly separated them, but I think a few short hints to which stuff will be supported by prefix sum vs. translating away window functions is still fine to have here.

Comment on lines 165 to 240
To have better performance (and to support non-invertible aggregations, e.g., min/max), we need to extend what the `broadcast` part of prefix sum is doing (`aggregate` can stay the same):
- `queries` will contain intervals specified by two indexes.
- `requests`: We can similarly compute a set of requests from `queries`. The change will only be inside the `flat_map`.
- `full_ranges`, `zero_ranges`, `used_ranges` stay the same.
- `init_states` won’t start at position 0, but at the lower end of the intervals in `queries`
- The iteration at the end will be mostly the same.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is specific to to implementation and doesn't make sense for someone who didn't study it. I think you need an explanation of the prefix_sum operator that you suggest we use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an explanation for prefix_sum's implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also improve this explanation a bit.

Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more comments. I really think we should start from a problem statement, a discussion of prefix sum, which combined can then result in a solution to said problem. My goal of this effort is to make sure that we consider alternatives and discuss the design such that we avoid surprises on the way, and that we have a reference that allows others to pick up this work at a later point in time. It doesn't need to be the documentation, because design docs are a point-in-time artifact, but it should enable us to get there!


I built [a small prototype outside Materialize](https://github.com/ggevay/window-funcs), where I verified that the output values are correct, and that output is quickly updated for small input changes.

For LAG/LEAD with *k > 1* (which computes the given expression not for the previous record, but for the record that was *k* records ago), the sum function could simply remember the last *k* values, acting on a `Vec<Val>` of length at most *k*, which would generalize `Option<Val>`. This works kind of ok for small *k*. A more complicated but probably better solution is to find the index for that element in the window partition that is *k* elements behind by using the same method as we use for calculating the intervals of the framed window functions (see below). Then with the index in hand, we can just do a self-join.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems an important part of your design, but for me it could use more explanation. Firstly, what does the function look like? Does it need to maintain a vector, or does it need to maintain an order? Do we insert and delete at the end? It influences what data structure you'll use.
Can you elaborate on the self-join part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will clarify this.


### Parallelism

DD's Prefix Sum should be data-parallel even inside a window partition. (It’s similar to [a Fenwick tree](https://en.wikipedia.org/wiki/Fenwick_tree), with sums maintained over power-of-2 sized intervals, from which you can compute a prefix sum by putting together LogN intervals.) TODO: But I wasn't able to actually observe a speedup in a simple test when adding cores, so we should investigate what’s going on with parallelization. There was probably just some technical issue, because all operations in the Prefix Sum implementation look parallelizable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like you don't want to make it part of the design, but I strongly suggest that any implementation includes a performance evaluation, and that we outline expected performance here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, yes, I added much more details about performance. Will add more when I have a prototype.

@ggevay ggevay force-pushed the windowing-design-doc branch 3 times, most recently from 9f4b35a to 679c330 Compare February 8, 2023 14:15
@ggevay ggevay force-pushed the windowing-design-doc branch 8 times, most recently from e0b3654 to e641252 Compare March 2, 2023 19:28
@ggevay ggevay force-pushed the windowing-design-doc branch from 2f1cee6 to 264f9c4 Compare March 13, 2023 12:48
@ggevay ggevay force-pushed the windowing-design-doc branch from 264f9c4 to 97b45ff Compare March 13, 2023 13:00
@ggevay ggevay force-pushed the windowing-design-doc branch 4 times, most recently from d0d3322 to 71a6ce6 Compare March 14, 2023 12:32
@ggevay ggevay force-pushed the windowing-design-doc branch from 71a6ce6 to 4ac23ed Compare March 14, 2023 16:10
@ggevay ggevay changed the title Add window functions design doc Add window functions design doc for prefix sum Oct 5, 2023
@ggevay ggevay marked this pull request as draft October 5, 2023 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-compute Area: compute

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants