Skip to content

Commit

Permalink
Update glossary and add lots of detail (#5)
Browse files Browse the repository at this point in the history
This commit defines several key terms related to query optimization that we found were heavily overloaded last semester, leading to lots of confusion.

Since this document will evolve over time, there are still many TODOs that will have to be fixed later.

* update glossary and add lots of detail

* add directory

* clean style

* refactor and make section for each term

* Fill in all main sections with something

* clean up ending but leave TODOs
  • Loading branch information
connortsui20 authored Jan 18, 2025
1 parent de96ad3 commit f1daf95
Showing 1 changed file with 288 additions and 32 deletions.
320 changes: 288 additions & 32 deletions docs/src/architecture/glossary.md
Original file line number Diff line number Diff line change
@@ -1,76 +1,332 @@
# Glossary

Definitions in query optimization can get very overloaded. Below is the language optd developers speak.
We have found internally that definitions in query optimization have become overloaded. This
document defines key names and definitions for concepts that are required in optimization.

Many of the names and definitions will be inspired by the Cascades framework. However, there are a
few important differences that need to be addressed considering our memo table will be persistent.

# Contents

- [Memo Table]
- [Expression]
- [Relational Expression]
- [Logical Expression]
- [Physical Expression]
- [Scalar Expression]
- **[Equivalence of Expressions](#expression-equivalence)**
- [Group]
- [Relational Group]
- [Scalar Group]
- [Query Plan]
- [Logical Plan]
- [Physical Plan]
- [Operator] / [Plan Node]
- [Relational Operator]
- [Logical Operator]
- [Physical Operator]
- [Scalar Operator]
- [Property]
- [Logical Property]
- [Physical Property]
- ? Derived Property ?
- [Rule]
- [Transformation Rule]
- [Implementation Rule]

[EQOP]: https://www.microsoft.com/en-us/research/publication/extensible-query-optimizers-in-practice/
[Memo Table]: #memo-table
[Expression]: #expression
[Relational Expression]: #relational-expression
[Logical Expression]: #logical-expression
[Physical Expression]: #physical-expression
[Scalar Expression]: #scalar-expression
[Group]: #group
[Relational Group]: #relational-group
[Scalar Group]: #scalar-group
[Query Plan]: #query-plan
[Logical Plan]: #logical-plan
[Physical Plan]: #physical-plan
[Plan Node]: #operator
[Operator]: #operator
[Relational Operator]: #relational-operator
[Logical Operator]: #logical-operator
[Physical Operator]: #physical-operator
[Scalar Operator]: #scalar-operator
[Property]: #property
[Logical Property]: #logical-property
[Physical Property]: #physical-property
[Rule]: #rule
[Transformation Rule]: #transformation-rule
[Implementation Rule]: #implementation-rule
[Enforcer Rule]: #enforcer-rule
[Enforcer Operator]: #enforcer-operator

### Relational operator
A **relation operator** (`RelNode`) describes an operation that can be evaluated to obtain a bag of tuples. In other literature this is also referred to as a query plan. A relational operator can be either logical or physical.
# Comparison with Cascades

### Scalar operator
In the Cascades framework, an expression is a tree of operators. In `optd`, we are instead defining
a logical or physical [Query Plan] to be a tree or DAG of [Operator]s. An expression in `optd`
strictly refers to the representation of an operator in the [Memo Table], not in query plans.

A **scalar operator** (`ScalarNode`) describes an operation that can be evaluated to obtain a single value. In other literature this is also referred to as a sql expression or a row expression.
See the [section below](#expression-logical-physical-scalar) on the kinds of expressions for more
information.

## Cascades
Most other terms in `optd` are similar to Cascades or are self-explanatory.

### Expressions
<br>

A **logical expression** is a tree/DAG of logical operators.
# Memo Table Terms

A **physical expression** is a tree/DAG of physical operators.
This section describes names and definitions of concepts related to the memo table.

The term **expression** in the context of Cascades can refer to either a relational or a scalar expression.
## Memo Table

### Properties
The memo table is the data structure used for dynamic programming in a top-down plan enumeration
search algorithm. The memo table consists of a mutually recursive data structure made up of
[Expression]s and [Group]s.

**Properties** are metadata computed (and sometimes stored) for each node in an expression.
Properties of an expression may be **required** by the original SQL query or **derived** from **physical properties of one of its inputs.**
## Expression

An expression is the representation of a non-materialized operator _inside_ of the [Memo Table].

**Logical properties** describe the structure and content of data returned by an expression.
There are 2 types of expressions: [Relational Expression]s and [Scalar Expression]s. A [Relational
Expression] can be either a [Logical Expression] or a [Physical Expression].

- Examples: row count, operator type,statistics, whether relational output columns can contain nulls.
Note that different kinds of expressions can have the same names as [Operator]s or [Plan Node]s, but
expressions solely indicate non-materialized relational or scalar operators in the [Memo Table].

**Physical properties** are characteristics of an expression that
impact its layout, presentation, or location, but not its logical content.
Operators outside of the [Memo Table] should _**not**_ be referred to as expressions, and should
instead be referred to as [Operator]s or [Plan Node]s.

- Examples: order and data distribution.
Notably, when we refer to an expression, _we are specifically talking about the representation of_
_operators inside the memo table_. A logical operator from an incoming logical plan should _not_
be called an [Logical Expression], and similarly a physical execution operator in the final output
physical plan should also _not_ be called an [Physical Expression].

Another way to think about this is that expressions are _not_ materialized, and plan nodes and
operators inside query plans _are_ materialized. Operators inside of query plans (both logical and
physical) should be referred to as either logical or physical [Operator]s or logical or physical
[Plan Node]s.

### Equivalence
Another key difference between expressions and [Plan Node]s is that expressions have 0 or more
**Group Identifiers** as children, and [Plan Node]s have 0 or more other [Plan Node]s as children.

Two logical expressions are equivalent if the logical properties of the two expressions are the same. They should produce the same set of rows and columns.
## Relational Expression

Two physical expressions are equivalent if their logical and physical properties are the same.
A relational expression is either a [Logical Expression] or a [Physical Expression].

Logical expression with a required physical property is equivalent to a physical expression if the physical expression has the same logical property and delivers the physical property.
When we say "relational", we mean representations of operations in the relational algebra of SQL.

Relational expressions differ from [Scalar Expression]s in that the result of algebraically
evaluating a relational expression produces a bag of tuples instead of a single scalar value.

### Group
See the following sections for more information.

A **group** consists of equivalent logical expressions.
## Logical Expression

A **relational group** consists of logically equivalent logical relational operators.
A logical expression is a version of a [Relational Expression].

A **scalar group** consists of logically equivalent logical scalar operators.
TODO(connor) Add more details.

### Rule
Examples of logical expressions include Logical Scan, Logical Join, or Logical Sort expressions
(which can just be shorthanded to Scan, Join, or Sort).

a **rule** in Cascades transforms an expression into equivalent expressions. It has the following interface.
## Physical Expression

A physical expression is a version of a [Relational Expression].

TODO(connor) Add more details.

Examples of physical expressions include Table Scan, Index Scan, Hash Join, or Sort Merge Join.

## Scalar Expression

A scalar expression is a version of an [Expression].

A scalar expression describes an operation that can be evaluated to obtain a single value. This can
also be referred to as a SQL expression, a row expression, or a SQL predicate.

TODO(everyone) Figure out the semantics of what a scalar expression really is.

Examples of scalar expressions include the expressions `t1.a < 42` or `t1.b = t2.c`.

## Expression Equivalence

Two [Logical Expression]s are equivalent if the [Logical Property]s of the two expressions are the
same. In other words, the [Logical Plan]s they represent produce the same set of rows and columns.

Two Physical Expressions are equivalent if their Logical and [Physical Property]s are the same.
In other words, the [Physical Plan]s they represent produce the same set of rows and columns, in the
exact same order and distribution.

TODO This next part is unclear?

A [Logical Expression] with a required [Physical Property] is equivalent to a [Physical Expression]
if the [Physical Expression] has the same [Logical Property] and delivers the [Physical Property].

## Group

A **group** is a set of equivalent [Expression]s.

We follow the definition of groups in the Volcano and Cascades frameworks. From the [EQOP] Microsoft
article (Section 2.2, page 205):

> In the memo, each class of equivalent expressions is called an _equivalence class_ or a _group_,
> and all equivalent expressions within the class are called _group expressions_ or simply
> _expressions_.
## Relational Group

A relational group is a set of 1 or more equivalent [Logical Expression]s and 0 or more equivalent
[Physical Expression]s.

For a given relational group, the first step of optimization is exploration, in which equivalent
[Logical Expression]s are added to the group via [Transformation Rule]s. Once the search space for
the group has been exhausted (all possible transformation rules have been applied to all logical
expressions in the group), the group can be physically optimized. At this point, the search
algorithm will apply [Implementation RUle]s to cost and find the best execution plan.

TODO Add more details.

TODO Add example.

## Scalar Group

A scalar group consists of equivalent [Scalar Expression]s.

TODO Add more details.

TODO Add example.

<br>

# Plan Enumeration and Search Concepts

This section describes names and definitions of concepts related to the general plan enumeration and
search of optimal query plans.

## Query Plan

A query plan is a tree or DAG of relational and scalar operators. We can consider query optimization
to be a function from an unoptimized query plan to an optimized query plan. More specifically, the
input plan is generally a [Logical Plan] and the output plan is always a [Physical Plan].

We generally consider query plans to either be completely logical or completely physical. However,
when dealing with rule matching and rule application to enumerate different but equivalent query
plans, we also deal with partially materialized query plans that can be a mix of both logical and
physical operators (as well as group identifiers and other scalar operators).

TODO Add more details about partially materialized plans.

## Logical Plan

A logical plan is a tree or DAG of [Logical Operator]s that can be evaluated to produce a bag of
tuples. This can also be referred to as a logical query plan. The [Operator]s that make up this
logical plan can be considered logical plan nodes.

## Physical Plan

A physical plan is a tree or DAG of [Physical Operator]s that can be evaluated by an execution
engine to produce a table. This can also be referred to as a physical query plan. The [Operator]s
that make up this physical plan can be considered physical plan nodes.

## Operator

An operator is the materialized version of an [Expression]. Like expressions, there are both
relational operators and scalar operators.

See the following sections for more information.

## Relational Operator

A relational operator is a node in a [Query Plan] (which is a tree or DAG), and is the materialized
version of a [Relational Expression].

## Logical Operator

A logical operator is a node in a [Logical Plan] (which is a tree or DAG), and is the materialized
version of a [Logical Expression].

## Physical Operator

A physical operator is a node in a [Physical Plan] (which is a tree or DAG), and is the materialized
version of a [Physical Expression].

## Scalar Operator

A scalar operator is a node in a [Query Plan] that describes a scalar expression, and can be
considered the materialized version of a [Scalar Expression].

## Property

A property is metadata computed (and sometimes stored) for a given relational expression.

Properties of an expression may be _required_ by the original SQL query or _derived_ from the
[Physical Property] of one of its inputs.

TODO Add more details.

## Logical Property

A logical property describes the structure and content of data returned by a given expression.

Examples: row count, operator type,statistics, whether relational output columns can contain nulls.

TODO Clean up and add more details.

## Physical Property

A physical property is a characteristic of an expression that impacts its layout, presentation, or
location, but not its logical content.

Examples: order and data distribution.

TODO Clean up and add more details.

## Rule

A rule transforms a query plan or sub-plan into an equivalent plan.

Rules should have an interface similar to the following:

```rust
trait Rule {
/// Checks whether the rule is applicable on the input expression.
fn check_pattern(expr: Expr) -> bool;

/// Transforms the expression into one or more equivalent expressions.
fn transform(expr: Expr) -> Vec<Expr>;
}
```

A **transformation rule** transforms a **part** of the logical expression into logical expressions. This is also called a logical to logical transformation in other systems.
TODO Actually figure out the interface for rules since it's probably not going to like that.

TODO Clean up and add more details.

A **implementation rule** transforms a **part** of a logical expression to an equivalent physical expression with physical properties.
## Transformation Rule

In Cascades, you don't need to materialize the entire query tree when applying rules. Instead, you can materialize expressions on demand while leaving unrelated parts of the tree as group identifiers.
A transformation rule transforms a _part_ of the logical expression into logical expressions.

This is also called a logical to logical transformation in other systems.

TODO Clean up and add more details.

## Implementation Rule

A implementation rule transforms a _part_ of a logical expression to an equivalent physical
expression with physical properties.

In Cascades, you don't need to materialize the entire query tree when applying rules. Instead, you
can materialize expressions on demand while leaving unrelated parts of the tree as group identifiers.

In other systems, there are physical to physical expression transformation for execution engine specific optimization, physical property enforcement, or distributed planning. At the moment, we are **not** considering physical-to-physical transformations.

**Enforcer rule:** *TODO!*
TODO Clean up and add more details.

## Enforcer Rule

TODO Write this section.

## Enforcer Operator

TODO Write this section.

0 comments on commit f1daf95

Please sign in to comment.