You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/developer/design/20230110_window_functions.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -116,9 +116,9 @@ The current way of executing window functions is to put entire window partitions
116
116
We'll use several approaches to solve the many cases mentioned in [Goals](#goals):
117
117
118
118
1. We'll use [DD's prefix_sum](https://github.com/TimelyDataflow/differential-dataflow/blob/master/src/algorithms/prefix_sum.rs) with some tricky sum functions and some generalizations.
119
-
2. We'll use a special-purpose rendering for LAG/LEAD of offset 1 with no IGNORE NULLS, which will be simpler and more efficient than Prefix Sum.
119
+
2. We'll use a [special-purpose rendering](#Special-rendering-for-LAG-and-LEAD) for LAG/LEAD of offset 1 with no IGNORE NULLS, which will be simpler and more efficient than Prefix Sum.
120
120
3. As an extension of 1., we'll use a generalization of DD's prefix sum to arbitrary intervals (i.e., not just prefixes).
121
-
4. We'll transform away window functions in some special cases (e.g., to TopK, or a simple grouped aggregation + self-join)
121
+
4. We'll transform away window functions in some special cases (e.g., to TopK, or a simple grouped aggregation + self-join).
122
122
5. Initially, we will resort to the old window function implementation in some cases, but this should become less and less over time. I think it will be possible to eventually implement all window function usage with the above 1.-4. approaches, but it will take time to get there.
123
123
124
124
### Getting window functions from SQL to the rendering
@@ -141,7 +141,7 @@ from cities;
141
141
142
142
To avoid creating a new enum variant in MirRelationExpr, we will recognize the above pattern during the MIR-to-LIR lowering, and create a new LIR enum variant for window functions. I estimate this pattern recognition to need about 15-20 if/match statements. It can happen that this pattern recognition approach turns out to be too brittle: we might accidentally leave out cases when the pattern is slightly different due to unrelated MIR transforms, plus we might break it from time to time with unrelated MIR transform changes. If this happens, then we might reconsider creating a new MIR enum variant later. (Which would be easier after the optimizer refactoring/cleanup.) For an extended discussion on alternative representations in HIR/MIR/LIR, see the [Representing window functions in each of the IRs](#Representing-window-functions-in-each-of-the-IRs) section.
143
143
144
-
Also, we will want to entirely transform away certain window function patterns; most notable is the ROW_NUMBER-to-TopK transform. For this, we need to canonicalize scalar expressions, which I think we usually do in MIR. This means that transforming away these window function patterns should happen on MIR. This will start by again recognizing the above general windowing pattern, and then performing pattern recognition of the TopK pattern.
144
+
Also, we will want to entirely transform away certain window function patterns; most notable is the ROW_NUMBER-to-TopK transform. For this, we need to canonicalize scalar expressions, which I think we usually do in MIR. This means that transforming away these window function patterns should happen on MIR. This will start by, again, recognizing the above general windowing pattern, and then performing pattern recognition of the TopK-expressed-with-ROW_NUMBER pattern.
145
145
146
146
### Prefix Sum
147
147
@@ -317,11 +317,11 @@ SELECT state, name, pop,
317
317
FROM cities;
318
318
```
319
319
320
-
These also operate based on a **frame**, similarly to window aggregations. (The above example query doesn't specify a frame, therefore it uses the default frame: from the beginning of the partition to the current row) They can be similarly implemented to window aggregations, i.e., we could “sum” up the relevant interval (that is not necessarily a prefix) with an appropriate sum function.
320
+
These also operate based on a **frame**, similarly to window aggregations. (The above example query doesn't specify a frame, therefore it uses the default frame: from the beginning of the partition to the current row.)
321
321
322
-
Alternatively, we could make these a bit faster (except for NTH_VALUE) if we just find the index of the relevant end of the interval (i.e., left end for FIRST_VALUE), and then self-join.
322
+
These could be implemented similarly to window aggregations, i.e., we could “sum” up the relevant interval (that is not necessarily a prefix) with an appropriate sum function. However, we will use a faster way to implement them (except for NTH_VALUE): we just find the index of the relevant end of the frame interval (i.e., left end for FIRST_VALUE), and then self-join. (This will happen in the MIR-to-LIR lowering, since finding the end of the interval is not expressible in MIR, as it is the same operation as finding the ends of frames for window aggregations.)
323
323
324
-
(And there are some special cases when we can transform away the window function usage: FIRST_VALUE with UNBOUNDED PRECEDING and LAST_VALUE with UNBOUNDED FOLLOWING should be transformed to just a (non-windowed) grouped aggregation + self-join instead of prefix sum trickery. Also, similarly for the case when there is no ORDER BY.)
324
+
There are also some special cases where we can transform away the window function usage: FIRST_VALUE with UNBOUNDED PRECEDING and LAST_VALUE with UNBOUNDED FOLLOWING should be transformed to just a Top1 on the PARTITION BY key + a self-join on the same key instead of prefix sum trickery. This approach also works for the case when there is no ORDER BY, since in this case an entire partition is a single peer group.
325
325
326
326
----------------------
327
327
@@ -392,7 +392,7 @@ How many bits we should chop off in one step involves a similar trade-off as a h
392
392
393
393
This will reduce the time overhead of `aggregate`. It will also reduce the memory overhead of `aggregate` by reducing the memory need of the internal operations, but it won't reduce the total output size of `aggregate`.
394
394
395
-
#### Special rendering for LAG/LEAD
395
+
#### Special rendering for LAG and LEAD
396
396
397
397
Instead of prefix sum, we will have a special rendering for LAG/LEAD: A similar iteration to `aggregate` will chop off 6 bits of the indexes in each step, but the `reduce` logic will simply perform the LAG/LEAD on those elements that went into one invocation of the logic (instead of summing intervals). It can perform the LAG on all but the first element of the list of elements that go into a single invocation of the logic. The first element it will just send onwards to later steps. Therefore, the output will include two kinds of values: one will be final LAG values, and the other will be values that are still waiting for their LAG results. These special values will be met up with the last elements of the input list of the `reduce` logic of the next step.
0 commit comments