Add window functions design doc for prefix sum #17077

ggevay · 2023-01-10T17:24:28Z

This PR moves the window functions design doc from Notion to a markdown file.

Rendered: https://github.com/ggevay/materialize/blob/windowing-design-doc/doc/developer/design/20230110_window_functions.md

(Windowing EPIC: https://github.com/MaterializeInc/database-issues/issues/4732)

doc/developer/design/20230110_window_functions.md

antiguru · 2023-01-16T15:23:56Z

This looks great! I'm not yet hitting approve as I need to ponder on what steps this should improve. @frankmcsherry, if you're happy with the contents, can you LGTM? Thanks!

ggevay · 2023-01-16T17:16:04Z

@antiguru The MIR and LIR representations are still open:

MIR: I think the top option now for MIR is kind of a hybrid between 1. and 3.: create a many-to-many Reduce as a new MirRelationExpr enum variant. This variant would initially only contain windowing, but after the windowing is done, we would probably merge TopK into this.
LIR: We haven't discussed much with @frankmcsherry about this yet, but it seems to me that a new variant would be the best option, because the rendering of the current Reduce or TopK wouldn't share any code with the rendering of windowing. (But in any case, this seems like a less critical question as MIR, since we don't have LIR transforms.)

I updated the design doc with the above. @frankmcsherry, could you please take a look? And then we can hopefully merge the PR, and then I'll start coding the representations tomorrow.

frankmcsherry

I got this far before I need a break. One thing that is challenging is that the format is not prose but rather bulleted lists. I have a hard time reading this and trying to assess what points are being made, whether they are clearly made, whether we are moving on to the next point or musing on something else. I do understand what you want to do, and we are getting to the point where I'll say "yes" just to not have to fight it, but I think there is a legit exercise in recording the reasoning that we want to do before we move forward.

doc/developer/design/20230110_window_functions.md

frankmcsherry · 2023-01-17T20:54:52Z

doc/developer/design/20230110_window_functions.md

+        - For LAG/LEAD with an offset of 1, the sum function will just remember the previous value. (And we can have a similar one for LEAD.) This looks a bit weird as a sum function, but it will work:
+            - it's associative;
+            - It’s not commutative, but that shouldn’t be an issue for Prefix Sum;
+            - It doesn’t have a zero element if implemented naively, but we can add one artificially: Normally we always return the right argument, but if the right argument is the artificially added zero element, then we return the left argument. Note: adding the zero element is important, because prefix sum sometimes sums in the zero element not just at the beginning, but randomly in the middle of an addition chain. E.g., when having *a, b, c* in the prefix then we might expect simply *a+b+c* or maybe *z+a+b+c* to be the prefix sum, but actually our Prefix Sum implementation might give us something like *z+a+b+z+c+z*.


Is this maybe a long way to say that the type should be Option<Val>? When I find myself in this situation, I try and return to the first moment where I said something funny and fix it. I think above, it is where you say "the sum function will just remember the previous value", and instead could have written ".. the previous value if it exists, and None if it does not". Then we don't have to patch up that explanation later on.

Yes, thanks, simplified the text.

frankmcsherry · 2023-01-17T20:56:50Z

doc/developer/design/20230110_window_functions.md

+            - I built [a small prototype outside Materialize](https://github.com/ggevay/window-funcs), where I verified that the output values are correct, and that output is quickly updated for small input changes.
+        - For LAG/LEAD with offset *k > 1* (which computes the given expression not for the previous record, but for the record that was *k* records ago):
+            - A sum function could simply remember the last *k* values. This works kind of ok for small *k*.
+                - There is a technical trick to make the overhead smaller by avoiding doing many copies on the list that is keeping the last **k** values: We could use such a reduction that has a `&mut` argument, and keep the last **k** values in that argument.


I did not understand this sentence. Is it significant that here you use k rather than k in the line above? It is perhaps fine to not have the technical tricks in the design doc, unless they are important for the viability.

The k and k difference came about somehow by accident when copy-pasting from Notion.

frankmcsherry · 2023-01-17T21:00:28Z

doc/developer/design/20230110_window_functions.md

+
+We’ll use the word ****index**** in the below text to mean the values of the ORDER BY column of the OVER clause, i.e., they are simply the values that determine the ordering. (Note that it’s sparse indexing, i.e., not every number occurs from 1 to n, but there are usually (big) gaps.)
+
+Now I’ll list all window functions, and how we’ll support them with one of the above approaches:


In reading this, it starts as an enumeration of window functions, but turns in to a bullet salad. If at all possible, it would help to decouple the moments that are meant to be enumerations from the associated thoughts, to make it easier for the reader to assess the exhaustiveness / find the special case they are looking to understand.

Yes, sorry, I didn't realize how annoying can a bullet salad be. I completely changed the structuring. The "How to handle each window function" section has numbered subsections, enumerating all window functions (grouped into categories). There are also two big horizontal lines delineating this big enumeration section.

frankmcsherry · 2023-01-17T21:05:24Z

doc/developer/design/20230110_window_functions.md

+
+**Duplicate indexes.** There might be rows which agree on both the PARTITION BY key and the ORDER BY key (the index). Prefix Sum doesn’t handle such duplicate indexes.
+
+- To eliminate these duplicates, we will number the elements inside each group with 0, 1, 2, …, and this will be an additional component of the prefix sum indexes.


What we do elsewhere is extend any ordering by an order on Row. Given the requirement on using integers for the ordering key, this could potentially just be a hash of the row, appended after the primary integer/integers.

The problem with that is that the entire Row might match. Note that in other ordering situations this possibility is not a problem, because if the entire Row matches, then there is no difference between ordering them one way or the other. But the issue here is not ordering stability, but that the current Prefix Sum implementation just cannot handle duplicate ordering keys.

I added a note that an alternative would be to make Prefix Sum correctly handle duplicate ordering keys.

doc/developer/design/20230110_window_functions.md

frankmcsherry · 2023-01-17T21:23:28Z

doc/developer/design/20230110_window_functions.md

+    - Compared to 3., it might be easier to skip windowing in many transforms. This is both good and bad:
+        - We can get a first version done more quickly. (And then potentially add optimizations later.)
+        - But we might leave some easy optimization opportunities on the table, which would come from already-existing transform code for `Reduce`.
+   - A new `MirRelationExpr` variant would mean we have to modify about 12-14 transforms (`LetRec` is pattern-matched in 12 files in the `transform` crate, `TopK` 14 times. See also [the recent LetRec addition](https://github.com/MaterializeInc/materialize/commit/9ac8e060d82487752ba28c42f7b146ff9f730ca3) for an example of how it looks when we add a new `MirRelationExpr` variant.)


For clarity, LetRec is disabled in all but a few transforms because we don't know that it is correct without a more thorough audit of the transforms. We have non-exhaustive matches (e.g. predicate pushdown) where I suspect it is not correct, and .. we'll have to do the same sort of checking for a WindowFunction variant).

I changed the text to clarify this.

ggevay · 2023-01-25T14:30:54Z

Thank you very much for the comments, @frankmcsherry! I addressed all of them. I think most things are resolved, I only have a few minor questions.

To loop back around, one imo totally valid proposal is to change nothing about the IR and use idiom detection in rendering, or potentially just in LIR.

Yes. I've changed the design doc to go with the idiom recognition approach.

Looking at the code snippet below, the only defect seems to be that we don't absorb Project into FlatMap, which we should do in any case.

Sorry, I don't really understand this. So we have

   Project (#3..=#5)                                                                             +
     Map (record_get[1](#1), record_get[0](#2), record_get[2](#2), record_get[0](#1))            +
       FlatMap unnest_list(#0)                                                                   +
         Project (#1)

Do you mean that we should change the FlatMap to refer to #1 instead of #0? But then we'll still have the extra column after the FlatMap is done, because FlatMap appends a column (similarly to Map). I don't see how we could make the FlatMap perform also the projection. Maybe the Project after the Map could project away the extra column, but then I wouldn't call this "absorbing the Project", but it would be something like projection lifting.

My understanding is that the specific reason this doesn't work for you is that someone needs to name a specific type for the ordering key, and the types are no longer visible in LIR if they are not captured.

In the meantime, I'm thinking to do the idiom recognition in the MIR-to-LIR lowering, and then there is no type problem. Is that ok? (Also, AFAIU we shouldn't have conditional code in the rendering, and this idiom recognition will be a giant piece of conditional code, so it seems to belong more in the lowering. Also, we want EXPLAIN PHYSICAL PLAN to show how we'll execute a window function.)

antiguru

I didn't finish reading, but I wanted to leave some early feedback. I think this is on track to becoming a good design doc, but there's work left to make the content more approachable. I'll continue tomorrow!

doc/developer/design/20230110_window_functions.md

antiguru · 2023-02-02T22:02:24Z

doc/developer/design/20230110_window_functions.md

+
+[Many users want to use window functions](https://www.notion.so/Window-Functions-Use-Cases-6ad1846a7da942dc8fa28997d9c220dd), but our current window function support is very inefficient: We recompute results for an entire window partition for any small change in the partition. So the only situations when our current support works is if the window partitions are either very small, or they rarely change.
+
+## Goals


This section seems to mix goals with implementation, for example that some things are easy and others are not, or that prefix_sum will handle things for us. Ideally, this section lays out what we'll have after applying the change to Materialize that we didn't have before. The actual mapping should be part of the design section I believe.

I mostly separated them, but I think a few short hints to which stuff will be supported by prefix sum vs. translating away window functions is still fine to have here.

doc/developer/design/20230110_window_functions.md

antiguru · 2023-02-02T22:07:53Z

doc/developer/design/20230110_window_functions.md

+To have better performance (and to support non-invertible aggregations, e.g., min/max), we need to extend what the `broadcast` part of prefix sum is doing (`aggregate` can stay the same):
+    - `queries` will contain intervals specified by two indexes.
+    - `requests`: We can similarly compute a set of requests from `queries`. The change will only be inside the `flat_map`.
+    - `full_ranges`, `zero_ranges`, `used_ranges` stay the same.
+    - `init_states` won’t start at position 0, but at the lower end of the intervals in `queries`
+    - The iteration at the end will be mostly the same.


This is specific to to implementation and doesn't make sense for someone who didn't study it. I think you need an explanation of the prefix_sum operator that you suggest we use.

I added an explanation for prefix_sum's implementation.

I will also improve this explanation a bit.

antiguru

I added more comments. I really think we should start from a problem statement, a discussion of prefix sum, which combined can then result in a solution to said problem. My goal of this effort is to make sure that we consider alternatives and discuss the design such that we avoid surprises on the way, and that we have a reference that allows others to pick up this work at a later point in time. It doesn't need to be the documentation, because design docs are a point-in-time artifact, but it should enable us to get there!

doc/developer/design/20230110_window_functions.md

antiguru · 2023-02-03T23:54:25Z

doc/developer/design/20230110_window_functions.md

+
+I built [a small prototype outside Materialize](https://github.com/ggevay/window-funcs), where I verified that the output values are correct, and that output is quickly updated for small input changes.
+
+For LAG/LEAD with *k > 1* (which computes the given expression not for the previous record, but for the record that was *k* records ago), the sum function could simply remember the last *k* values, acting on a `Vec<Val>` of length at most *k*, which would generalize `Option<Val>`. This works kind of ok for small *k*. A more complicated but probably better solution is to find the index for that element in the window partition that is *k* elements behind by using the same method as we use for calculating the intervals of the framed window functions (see below). Then with the index in hand, we can just do a self-join.


This seems an important part of your design, but for me it could use more explanation. Firstly, what does the function look like? Does it need to maintain a vector, or does it need to maintain an order? Do we insert and delete at the end? It influences what data structure you'll use.
Can you elaborate on the self-join part?

I will clarify this.

doc/developer/design/20230110_window_functions.md

antiguru · 2023-02-04T00:01:02Z

doc/developer/design/20230110_window_functions.md

+
+### Parallelism
+
+DD's Prefix Sum should be data-parallel even inside a window partition. (It’s similar to [a Fenwick tree](https://en.wikipedia.org/wiki/Fenwick_tree), with sums maintained over power-of-2 sized intervals, from which you can compute a prefix sum by putting together LogN intervals.) TODO: But I wasn't able to actually observe a speedup in a simple test when adding cores, so we should investigate what’s going on with parallelization. There was probably just some technical issue, because all operations in the Prefix Sum implementation look parallelizable.


This sounds like you don't want to make it part of the design, but I strongly suggest that any implementation includes a performance evaluation, and that we outline expected performance here.

Ok, yes, I added much more details about performance. Will add more when I have a prototype.

doc/developer/design/20230110_window_functions.md

ggevay added the A-compute Area: compute label Jan 10, 2023

ggevay commented Jan 10, 2023

View reviewed changes

doc/developer/design/20230110_window_functions.md Outdated Show resolved Hide resolved

ggevay requested review from antiguru and frankmcsherry January 10, 2023 17:27

ggevay force-pushed the windowing-design-doc branch 5 times, most recently from de778a0 to 0fa49b5 Compare January 11, 2023 16:52

ggevay force-pushed the windowing-design-doc branch 3 times, most recently from cc61510 to 2d51ab0 Compare January 17, 2023 18:11

frankmcsherry reviewed Jan 17, 2023

View reviewed changes

ggevay force-pushed the windowing-design-doc branch from 22fafde to 9bcfd69 Compare January 31, 2023 12:10

antiguru reviewed Feb 2, 2023

View reviewed changes

antiguru reviewed Feb 4, 2023

View reviewed changes

ggevay force-pushed the windowing-design-doc branch 3 times, most recently from 9f4b35a to 679c330 Compare February 8, 2023 14:15

ggevay force-pushed the windowing-design-doc branch 8 times, most recently from e0b3654 to e641252 Compare March 2, 2023 19:28

ggevay added 18 commits March 13, 2023 13:48

Various improvements

df2862c

wip

c7085f1

Prefix Sum WIP

819e8b9

Prefix sum implementation details WIP

fa3afda

Custom datastructure alternative

6d3ec03

Eliminate some remaining bullet salad

57701f4

Switch between custom datastructure and prefix sum

a0b189b

Duplicate indexes WIP

02c2055

Sketch a new optimization and performance section

2c852b5

Address Jan's comments and other things

c9488a4

Clarify and integrate some thoughts

035880c

Rewrite the ORDER BY types section

80e7136

Peer groups

223bcea

Hashing

b9e730b

Prefix sum implementation more details, and optimizations

1575c7f

Optimizations

3fe6dfa

Minor fixes

4a9a3be

Hash-based ordering discussion

d53b6a3

ggevay force-pushed the windowing-design-doc branch from 2f1cee6 to 264f9c4 Compare March 13, 2023 12:48

Address Marcos' comments

97b45ff

ggevay force-pushed the windowing-design-doc branch from 264f9c4 to 97b45ff Compare March 13, 2023 13:00

Mention columnation

5148199

ggevay force-pushed the windowing-design-doc branch 4 times, most recently from d0d3322 to 71a6ce6 Compare March 14, 2023 12:32

Address Marcos' comments 2

4ac23ed

ggevay force-pushed the windowing-design-doc branch from 71a6ce6 to 4ac23ed Compare March 14, 2023 16:10

ggevay changed the title ~~Add window functions design doc~~ Add window functions design doc for prefix sum Oct 5, 2023

ggevay marked this pull request as draft October 5, 2023 08:36


		We’ll use the word **index** in the below text to mean the values of the ORDER BY column of the OVER clause, i.e., they are simply the values that determine the ordering. (Note that it’s sparse indexing, i.e., not every number occurs from 1 to n, but there are usually (big) gaps.)

		Now I’ll list all window functions, and how we’ll support them with one of the above approaches:


		Duplicate indexes. There might be rows which agree on both the PARTITION BY key and the ORDER BY key (the index). Prefix Sum doesn’t handle such duplicate indexes.

		- To eliminate these duplicates, we will number the elements inside each group with 0, 1, 2, …, and this will be an additional component of the prefix sum indexes.


		[Many users want to use window functions](https://www.notion.so/Window-Functions-Use-Cases-6ad1846a7da942dc8fa28997d9c220dd), but our current window function support is very inefficient: We recompute results for an entire window partition for any small change in the partition. So the only situations when our current support works is if the window partitions are either very small, or they rarely change.

		## Goals


		I built [a small prototype outside Materialize](https://github.com/ggevay/window-funcs), where I verified that the output values are correct, and that output is quickly updated for small input changes.

		For LAG/LEAD with k > 1 (which computes the given expression not for the previous record, but for the record that was k records ago), the sum function could simply remember the last k values, acting on a `Vec<Val>` of length at most k, which would generalize `Option<Val>`. This works kind of ok for small k. A more complicated but probably better solution is to find the index for that element in the window partition that is k elements behind by using the same method as we use for calculating the intervals of the framed window functions (see below). Then with the index in hand, we can just do a self-join.


		### Parallelism

		DD's Prefix Sum should be data-parallel even inside a window partition. (It’s similar to [a Fenwick tree](https://en.wikipedia.org/wiki/Fenwick_tree), with sums maintained over power-of-2 sized intervals, from which you can compute a prefix sum by putting together LogN intervals.) TODO: But I wasn't able to actually observe a speedup in a simple test when adding cores, so we should investigate what’s going on with parallelization. There was probably just some technical issue, because all operations in the Prefix Sum implementation look parallelizable.

Add window functions design doc for prefix sum #17077

Are you sure you want to change the base?

Add window functions design doc for prefix sum #17077

Uh oh!

Conversation

ggevay commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

antiguru commented Jan 16, 2023

Uh oh!

ggevay commented Jan 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

frankmcsherry left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggevay commented Jan 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antiguru left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antiguru left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggevay commented Jan 10, 2023 •

edited

Loading

ggevay commented Jan 16, 2023 •

edited

Loading

ggevay commented Jan 25, 2023 •

edited

Loading