Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type Model #4

Merged
merged 10 commits into from
Jan 29, 2025
Merged

Type Model #4

merged 10 commits into from
Jan 29, 2025

Conversation

connortsui20
Copy link
Member

@connortsui20 connortsui20 commented Jan 15, 2025

This PR models almost all of the types that will be necessary for optimization. This includes:

  • generic relational algebra operators that allow us to use the same "type" for both expressions in the memo table and operators in the plans
  • logical / physical plans
  • scalar operators and expressions
  • partially materialized logical plans for rule binding
  • transformation rule + implementation rule trait and some empty structs that implement them

I've named the crate itself optd-core. This can be subject to change, but I feel this is a reasonable default for now.

TODO: need to wait on #3 and #12 to be merged before proper CI checks can happen

Edit: I removed the cargo rustdoc check because its creating more problem than it would solve, see #14

@connortsui20 connortsui20 force-pushed the connor/types branch 2 times, most recently from e422259 to 1f7d7fc Compare January 15, 2025 20:09
@connortsui20 connortsui20 force-pushed the connor/types branch 3 times, most recently from bb1e918 to d54fd27 Compare January 28, 2025 21:54
@connortsui20 connortsui20 changed the title Modeling of types to represent query optimization with a persistent memo table Type Model Jan 28, 2025
@connortsui20 connortsui20 marked this pull request as ready for review January 28, 2025 23:04
@connortsui20 connortsui20 force-pushed the connor/types branch 3 times, most recently from d73bed6 to 0bc61c9 Compare January 29, 2025 03:22
@cmu-db cmu-db deleted a comment from codecov-commenter Jan 29, 2025
@codecov-commenter
Copy link

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

Copy link
Collaborator

@AlSchlo AlSchlo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed some variables for clarity, and moved around some code.

I added support to expressions and scalars.
Also added some MVP operators in there.

This commit adds core types for all of the core concepts we need to
represent during query optimization (via definitions in the glossary):

- Logical/physical plans
- Partially materialized logical plans
- Logical/physical/scalar operators
- Logical/physical/scalar expressions
- Transformation/implementation rules

Co-authored-by: Connor Tsui <[email protected]>
Co-authored-by: Alexis Schlomer <[email protected]>
@connortsui20
Copy link
Member Author

@AlSchlo Several thoughts:

  • Are we using the same GroupId for both relational and scalar groups? I actually don't have a strong preference either way, since that is easily changed in the future
  • I don't really like the name RelLink because I still think that Rel is somewhat ambiguous (hence why I spelled out "relational" every time I used it). I also abandoned Link because I felt it was not descriptive enough
  • Calling the root operator node instead of root I think is somewhat questionable, but I don't have any strong feelings about it
  • I'm not totally convinced about the LogicalExpression and PhysicalExpression enums, as it implies that scalars can be both a logical expression and a physical expression. Based on what we talked about yesterday, this doesn't really make a lot of sense -- scalars should be a completely different type. So I'm inclined to keep a RelationalExpression and ScalarExpression type...

@AlSchlo
Copy link
Collaborator

AlSchlo commented Jan 29, 2025

Are we using the same GroupId for both relational and scalar groups?

Yes. We already have the differentiator in the "upper" layer of the enum, right?

I don't really like the name RelLink because I still think that Rel is ambiguous

Me neither, but I liked child even less, as scalars are also children.

Calling the root operator node instead of root

Yes, since it's not always the root. I find that to be confusing.

I'm not totally convinced about the LogicalExpression and PhysicalExpression enums

Requires some extra thought. Not too sure if we want Scalars as a completely different type. They are after all both logical and physical in Cascades.

Renames `RelLink` to `Relation` and `ScalarLink` to `Scalar`.
Copy link
Member

@yliang412 yliang412 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work guys. This is solid foundation. Besides the nits, the two main improvements I want to point out us regarding the transformation rule API and ways we represent subqueries in the system.

For the transformation rule API, to add newly generated expressions back to the memo table + recursively check for duplicates, you probably want to return some kind of partial plan.

For subqueries, I would argue we might want to handle it in some form of our representation. So there is the case where scalar could contain relation children. But after normalization, we should be fine. We might also want to support executing a subquery without unnesting (it will be a useful feature).

@yliang412 yliang412 requested a review from SarveshOO7 January 29, 2025 18:12
@AlSchlo AlSchlo merged commit c41cf41 into main Jan 29, 2025
12 checks passed
@AlSchlo AlSchlo deleted the connor/types branch January 29, 2025 22:24
@connortsui20 connortsui20 mentioned this pull request Jan 30, 2025
connortsui20 added a commit that referenced this pull request Jan 31, 2025
## Problem

#4 Was the first major PR, but since we were making a lot of changes
last minute there were a few things that slipped through with respect to
quality control.

## Summary of changes

- Added very strict clippy documentation requirements. Discussion below.
- Added documentation for everything except the
`operator::relational::physical` module and the `operator::scalar`
module, as those seem kind of unstable right now.
- Changed the scalar generic param of expressions to use `ScalarGroupId`
instead of `GroupId`.
- Added several TODOs that I think will be resolved pretty quickly,
though I'd appreciate it if others took a look at those in the case that
we can actually get those done now.

Now that we have something to work off of, requiring documentation on
everything (including private items) means that we'll pay the cost right
now of making sure people understand what we're doing in the future. I'm
willing to relax it a little bit if we put _other_ stuff in place that
ensure nobody in the future encounters large chunks of code with zero
documentation.
@yliang412 yliang412 mentioned this pull request Feb 11, 2025
yliang412 added a commit that referenced this pull request Feb 14, 2025
## Problem

With the initial representation and storage added in #4 and #22, we now
want to support the full pipeline going from parsing SQL, optimizing the
plan using optd, and executing the query in Datafusion.

## Summary of changes

- Integrate all @SarveshOO7's good work in
#10
- Added one mock physical implementation rule + operator for each
logical operator
- Refactor scalar operator storage and reduce code bloat.
- Add physical storage tables and memo API.
- Bump MSRV to 1.81.0 to be compatible with datafusion 45.0.0:
apache/datafusion#14330

---------

Signed-off-by: Yuchen Liang <[email protected]>
Co-authored-by: SarveshOO7 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants