Skip to content
This repository has been archived by the owner on Jan 7, 2025. It is now read-only.

feat(ng-kernel): add the persistent kernel #251

Closed
wants to merge 4 commits into from
Closed

Conversation

skyzh
Copy link
Member

@skyzh skyzh commented Nov 27, 2024

Thinking about the new start point of optd, let's build something from a MVP instead of getting everything working at the first place.

We have naive + persistent memo table. The memo table doesn't store cost and properties for now. It finds the duplicates. It's async.

On the persistent side, we use sqlx to run the queries in in-memory sqlite to eliminate external dependencies when running tests. The persistent memo table currently supports dedup predicates.

Comment on lines +6 to +9
pub mod memo;
pub mod naive_memo;
pub mod optimizer;
pub mod persistent_memo;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every file needs module-level documentation. And following the standard of most modern Rust open-source projects, this should be in src/cascades/mod.rs, especially since it is just making its submodules public.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, most of them were pub(crate) in the old codebase

Comment on lines +11 to +24
anyhow = "1"
async-recursion = "1"
async-trait = "0.1"
arrow-schema = "47.0.0"
tracing = "0.1"
ordered-float = "4"
itertools = "0.13"
serde = { version = "1.0", features = ["derive", "rc"] }
chrono = "0.4"
sqlx = { version = "0.8", features = [
"runtime-tokio",
"sqlite",
] } # TODO: strip the features, move to another crate
serde_json = { version = "1" } # TODO: move to another crate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should strive to make our dependency versioning consistent. Ideally, we follow minimal versions (-Zminimal-versions), but I can understand that might be tedious. At the very least, we should choose something consistent.

Comment on lines +8 to +15
#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Debug, Default, Hash)]
pub struct GroupId(pub(super) usize);

#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Debug, Default, Hash)]
pub struct ExprId(pub usize);

#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Debug, Default, Hash)]
pub struct PredId(pub usize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all need documentation, even if it is just a single line. Someone reading this might have no idea what PredId actually represents.

@skyzh
Copy link
Member Author

skyzh commented Nov 27, 2024

The next steps:

We should split the SQL migration things into a new crate, and potentially use a ORM, though I think an ORM is too heavy in our case -- I don't think we want to worry about schemas changes of optd-core for now.

On the memo table side, we can continue adding new functionalities: logical props, winner, etc.

And, with the current memo table, we can already implement 4 out of 5 cascade tasks (optimize input needs to know and update the winner)

Plus, I think we can start documenting whatever we have now + cut down the features we need on these crates. i.e., we should only use sqlx any or impl Connection interface in the core, and let the cli to decide which database backend to use.

There are also a few optimization opportunities in the memo table -- creating indexes, better merge group (can we really do lazy merging on SQL?)

Comment on lines +17 to +21
impl Display for GroupId {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "!{}", self.0)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why there is an exclamation mark here? Shouldn't this just be something like:

impl Display for GroupId {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        write!(f, "GroupId: {}", self.0)
    }
}

If there is a reason, then it should be documented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old convention from optd-core that makes dump pretty...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be documented! Even if this was in the old repo, it's not obvious that this is the purpose.

Comment on lines +6 to +7
//! The RelNode is the basic data structure of the optimizer. It is dynamically typed and is
//! the internal representation of the plan nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start, but this could be much more specific. What else does this file contain?

PredNode, Value,
},
};

Copy link
Member

@connortsui20 connortsui20 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in this file seems pretty important... is there a reason why it is in src/tests/common and not some normal submodule?

Also, generally we use tests outside of the src directory for integration tests.

Copy link
Member Author

@skyzh skyzh Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a demo repr for testing the core, it will be used in cascades/heuristics/memo unit tests

Comment on lines +45 to +55
impl std::fmt::Display for MemoTestRelTyp {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{:?}", self)
}
}

impl std::fmt::Display for MemoTestPredTyp {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{:?}", self)
}
}
Copy link
Member

@connortsui20 connortsui20 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that we need to implement Display for everything instead of just deriving Debug?

Comment on lines +68 to +77
// TODO: move this into nodes.rs?
#[derive(Clone, Debug, Hash, PartialEq, Eq, Serialize, Deserialize)]
pub struct PersistentPredNode<T: PersistentNodeType> {
/// A generic predicate node type
pub typ: T::PredType,
/// Child predicate nodes, always materialized
pub children: Vec<PersistentPredNode<T>>,
/// Data associated with the predicate, if any
pub data: Option<Value>,
}
Copy link
Member

@connortsui20 connortsui20 Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation on this struct needs to be more specific. "A generic predicate node type" doesn't really say anything beyond what pub typ: T::PredType says, and it doesn't say what the purpose of this struct is for.

Edit: See https://github.com/cmu-db/optd/pull/251#issuecomment-2502532804

//
// Use of this source code is governed by an MIT-style license that can be found in the LICENSE file or at
// https://opensource.org/licenses/MIT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need documentation on what is in this file. Even when reading over it, I'm not sure why the things in here should be considered "common"

#[derive(Clone, Debug, Hash, PartialEq, Eq, Serialize, Deserialize)]
pub struct PersistentPredNode<T: PersistentNodeType> {
/// A generic predicate node type
pub typ: T::PredType,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, most open source projects I see that have type tags use the word "kind" instead of "typ", and I personally thing that reads better and make more sense.


// TODO: move this into nodes.rs?
#[derive(Clone, Debug, Hash, PartialEq, Eq, Serialize, Deserialize)]
pub struct PersistentPredNode<T: PersistentNodeType> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are predicate nodes absolutely necessary for an minimum viable product? If they are, then I can understand (and it should be documented why it is absolutely necessary). If not, then is it possible to either remove this (temporarily, of course), or is it possible to move this logic into some other submodule? This might make more sense if there are detailed module-level docs.

Comment on lines +79 to +105
impl<T: PersistentNodeType> From<PersistentPredNode<T>> for PredNode<T> {
fn from(node: PersistentPredNode<T>) -> Self {
PredNode {
typ: node.typ,
children: node
.children
.into_iter()
.map(|x| Arc::new(x.into()))
.collect(),
data: node.data,
}
}
}

impl<T: PersistentNodeType> From<PredNode<T>> for PersistentPredNode<T> {
fn from(node: PredNode<T>) -> Self {
PersistentPredNode {
typ: node.typ,
children: node
.children
.into_iter()
.map(|x| x.as_ref().clone().into())
.collect(),
data: node.data,
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level documentation on why these should be implemented would be nice.

Comment on lines +91 to +97
impl Value {
pub fn as_u8(&self) -> u8 {
match self {
Value::UInt8(i) => *i,
_ => panic!("Value is not an u8"),
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this boilerplate (above and below) should be moved into a submodule.

Comment on lines +217 to +218
pub trait NodeType:
PartialEq + Eq + Hash + Clone + 'static + Display + Debug + Send + Sync
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this could just be Copy and most of these bounds can just go away... Either that or Into<i16> or something similar?

Comment on lines +225 to +228
pub trait PersistentNodeType: NodeType {
fn serialize_pred(pred: &ArcPredNode<Self>) -> serde_json::Value;

fn deserialize_pred(data: serde_json::Value) -> ArcPredNode<Self>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there should be different methods on these. Going back to a previous comment, do we absolutely need predicates for the MVP? And then shouldn't there be methods to serialize the entire node instead of just the type tag and the predicate?

Comment on lines +235 to +239
/// A pointer to a plan node
pub type ArcPlanNode<T> = Arc<PlanNode<T>>;

/// A pointer to a predicate node
pub type ArcPredNode<T> = Arc<PredNode<T>>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes things harder to read, and also harder to traverse with an LSP.

Comment on lines +241 to +245
#[derive(Clone, Debug, Hash, PartialEq, Eq)]
pub enum PlanNodeOrGroup<T: NodeType> {
PlanNode(ArcPlanNode<T>),
Group(GroupId),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing documentation.

Comment on lines +283 to +292
#[derive(Clone, Debug, Hash, PartialEq, Eq)]
pub struct PlanNode<T: NodeType> {
/// A generic plan node type
pub typ: T,
/// Child plan nodes, which may be materialized or placeholder group IDs
/// based on how this node was initialized
pub children: Vec<PlanNodeOrGroup<T>>,
/// Predicate nodes, which are always materialized
pub predicates: Vec<ArcPredNode<T>>,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs highly detailed documentation. Considering the importance of this struct, someone reading this codebase should only have to look at this single struct and be able to understand everything about optd's operator / plan representation.

@skyzh
Copy link
Member Author

skyzh commented Nov 27, 2024

I don't know how detailed you want -- could you please help document PlanNode to set a common expectations for us?

Copy link
Member

@connortsui20 connortsui20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'm going to stop myself because I feel I am getting too repetitive and it's not really conducive to good progress, and I feel myself getting too opinionated.

I think that this is a great start, but considering we want to build a kernel of a foundation, we have to hold the code quality to a very high standard. This means that everything needs to be documented.

I think that documentation will help a lot with readability, and it will probably also help us identify things that we can probably do to minimize the MVP. Do we need separate predicate representation for an MVP? Should we even have generic types at all? Should plan nodes be able to store their own fields to facilitate easy serialization and deserialization?

I think the biggest thing we need to worry about is what this would look like when backed by a database via an ORM. I think it makes a lot more sense to just allow plan nodes to have whatever fields they want as it means getting all of the information about a single expression is a single look + deserialization. By storing children and predicates as vectors, we necessitate multiple joins just to get information about a single expression.

If it is at all possible to minimize this even further, acknowledging that we will be losing functionality, that is what we should strive for with this. Aim for the smallest thing possible.

@connortsui20
Copy link
Member

I don't know how detailed you want -- could you please help document PlanNode to set a common expectations for us?

Every single field should have an explanation on why we need it to be there and how code should interact with it. For example, I had no idea that typ: T is a generic type tag that other developers are allowed to choose. I thought for the longest time that the generic type WAS the tag. That is a subtle difference that is very not obvious, because when you see a type like Foo<T>, you expect it to have some sort of "wrapping" behavior around T (think Vec<T> or Option<T>), but that is not what is happening here.

Why do the children and predicates need to be vectors? What does "materialized" actually mean? I now know what it means, but if someone comes along and looks at it, they would have no idea.

These are some of the things that should be written down. Or if it is already written down, there should a link to where it is.

@skyzh skyzh closed this Dec 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants