|
1 |
| -# Extensible SQL Lexer and Parser for Rust |
| 1 | +> _The following pertains to the `cst` branch; the [upstream README is below](#upstream-readme)._ |
| 2 | +> |
| 3 | +> ⚠️ This branch is regularly rebased. Please let me know before working off it to coordinate. |
| 4 | +
|
| 5 | +**Preserving full source code information ([#161](https://github.com/andygrove/sqlparser-rs/issues/161)) would enable SQL rewriting/refactoring tools based on sqlparser-rs.** For example: |
| 6 | +1. **Error reporting**, both in the parser and in later stages of query processing, would benefit from knowing the source code location of SQL constructs ([#179](https://github.com/andygrove/sqlparser-rs/issues/179)) |
| 7 | +2. **SQL pretty-printing** requires comments to be preserved in AST (see [#175](https://github.com/andygrove/sqlparser-rs/issues/175), mentioning [forma](https://github.com/maxcountryman/forma)) |
| 8 | +3. **Refactoring via AST transformations** would also benefit from having full control over serialization, a possible solution for dialect-specific "writers" ([#18](https://github.com/andygrove/sqlparser-rs/issues/18)) |
| 9 | +4. Analyzing partially invalid code may be useful in the context of an IDE or other tooling. |
| 10 | + |
| 11 | +**I think that adopting [rust-analyzer's design][ra-syntax], that includes a lossless syntax tree, is the right direction for sqlparser-rs.** In addition to solving the use-cases described above, it helps in other ways: |
| 12 | + |
| 13 | +5. We can omit syntax that does not affect semantics of the query (e.g. [`ROW` vs `ROWS`](https://github.com/andygrove/sqlparser-rs/blob/418b9631ce9c24cf9bb26cf7dd9e42edd29de985/src/ast/query.rs#L416)) from the typed AST by default, reducing the implementation effort. |
| 14 | +6. Having a homogenous syntax tree also alleviates the need for a "visitor" ([#114](https://github.com/andygrove/sqlparser-rs/pull/114)), also reducing the burden on implementors of new syntax |
| 15 | + |
| 16 | +In 2020 many new people contributed to `sqlparser-rs`, some bringing up the use-cases above. I found myself mentioning this design multiple times, so I felt I should "show the code" instead of just talking about it. |
| 17 | + |
| 18 | +Current typed AST vs rowan |
| 19 | +========================== |
| 20 | + |
| 21 | +To recap, the input SQL is currently parsed directly into _typed AST_ - with each node of the tree represented by a Rust `struct`/`enum` of a specific type, referring to other structs of specific type, such as: |
| 22 | + |
| 23 | + struct Select { |
| 24 | + pub projection: Vec<SelectItem>, |
| 25 | + ... |
| 26 | + } |
| 27 | + |
| 28 | +We try to retain most of "important" syntax in this representation (including, for example, [`Ident::quote_style`](https://github.com/andygrove/sqlparser-rs/blob/d32df527e68dd76d857f47ea051a3ec22138469b/src/ast/mod.rs#L77) and [`OffsetRows`](https://github.com/andygrove/sqlparser-rs/blob/418b9631ce9c24cf9bb26cf7dd9e42edd29de985/src/ast/query.rs#L416)), but there doesn't seem to be a practical way to extend it to also store whitespace, comments, and source code spans. |
| 29 | + |
| 30 | +The lossless syntax tree |
| 31 | +------------------------ |
| 32 | + |
| 33 | +In the alternative design, the parser produces a tree (which I'll call "CST", [not 100% correct though it is](https://dev.to/cad97/lossless-syntax-trees-280c)), in which every node has has the same Rust type (`SyntaxNode`), and a numeric `SyntaxKind` determines what kind of node it is. Under the hood, the leaf and the non-leaf nodes are different: |
| 34 | + |
| 35 | +* Each leaf node stores a slice of the source text; |
| 36 | +* Each intermediate node represents a string obtained by concatenatenating the text of its children; |
| 37 | +* The root node, consequently, represents exactly the original source code. |
| 38 | + |
| 39 | +_(The actual [rust-analyzer's design][ra-syntax] is more involved, but the details are not relevant to this discussion.)_ |
| 40 | + |
| 41 | +As an example, an SQL query "`select DISTINCT /* ? */ 1234`" could be represented as a tree like the following one: |
| 42 | + |
| 43 | + |
| 44 | + |
| 45 | + |
| 46 | + DISTINCT_OR_ALL |
| 47 | + |
| 48 | + |
| 49 | + |
| 50 | + |
| 51 | + |
| 52 | + |
| 53 | + |
| 54 | +_(Using the `SyntaxKind@start_pos..end_pos "relevant source code"` notation)_ |
| 55 | + |
| 56 | +Note how all the formatting and comments are retained. |
| 57 | + |
| 58 | +Such tree data structure is available for re-use as a separate crate ([`rowan`](https://github.com/rust-analyzer/rowan)), and **as the proof-of-concept I extended the parser to populate a rowan-based tree _along with the typed AST_, for a few of the supported SQL constructs.** |
| 59 | + |
| 60 | +Open question: The future design of the typed AST |
| 61 | +------------------------------------------------- |
| 62 | + |
| 63 | +Though included in the PoC, **constructing both an AST and a CST in parallel should be considered a transitional solution only**, as it will not let us reap the full benefits of the proposed design (esp. points 1, 4, and 5). Additionally, the current one-AST-fits-all approach makes every new feature in the parser a (semver) breaking change, and makes the types as loose as necessary to fit the common denominator (e.g. if something is optional in one dialect, it has to be optional in the AST). |
| 64 | + |
| 65 | +What can we do instead? |
| 66 | + |
| 67 | +### Rust-analyzer's AST |
| 68 | + |
| 69 | +In rust-analyzer the AST layer does not store any additional data. Instead a "newtype" (a struct with exactly one field - the underlying CST node) is defined for each AST node type: |
| 70 | + |
| 71 | + struct Select { syntax: SyntaxNode }; // the newtype |
| 72 | + impl Select { |
| 73 | + fn syntax(&self) -> &SyntaxNode { &self.syntax } |
| 74 | + fn cast(syntax: SyntaxNode) -> Option<Self> { |
| 75 | + match syntax.kind { |
| 76 | + SyntaxKind::SELECT => Some(Select { syntax }), |
| 77 | + _ => None, |
| 78 | + } |
| 79 | + } |
| 80 | + ... |
| 81 | + |
| 82 | +Such newtypes define APIs to let the user navigate the AST through accessors specific to this node type: |
| 83 | + |
| 84 | + // ...`impl Select` continued |
| 85 | + pub fn distinct_or_all(&self) -> Option<DistinctOrAll> { |
| 86 | + AstChildren::new(&self.syntax).next() |
| 87 | + } |
| 88 | + pub fn projection(&self) -> AstChildren<SelectItem> { |
| 89 | + AstChildren::new(&self.syntax) |
| 90 | + } |
| 91 | + |
| 92 | +These accessors go through the node's direct childen, looking for nodes of a specific `SyntaxKind` (by trying to `cast()` them to the requested output type). |
| 93 | + |
| 94 | +This approach is a good fit for IDEs, as it can work on partial / invalid source code due to its lazy nature. Whether it is acceptable in other contexts is an open question ([though it was **not** rejected w.r.t rust-analyzer and rustc sharing a libsyntax2.0](https://github.com/rust-lang/rfcs/pull/2256)). |
| 95 | + |
| 96 | +### Code generation and other options for the AST |
| 97 | + |
| 98 | +Though the specific form of the AST is yet to be determined, it seems necessary to use some form of automation to build an AST based on a CST, so that we don't have 3 places (the parser, the AST, and the CST->AST converter) to keep synchronised. |
| 99 | + |
| 100 | +Rust-analyzer implements its own simple code generator, which would [generate](https://github.com/rust-analyzer/rust-analyzer/blob/a0be39296d2925972cacd9fbf8b5fb258fad6947/xtask/src/codegen/gen_syntax.rs#L47) methods like the above based on a definition [like](https://github.com/rust-analyzer/rust-analyzer/blob/a0be39296d2925972cacd9fbf8b5fb258fad6947/xtask/src/ast_src.rs#L293) this: |
| 101 | + |
| 102 | + const AST_SRC: AstSrc = AstSrc { |
| 103 | + nodes: &ast_nodes! { |
| 104 | + struct Select { |
| 105 | + DistinctOrAll, |
| 106 | + projection: [SelectItem], |
| 107 | + ... |
| 108 | + |
| 109 | +_(Here the `ast_nodes!` macro converts something that looks like a simplified `struct` declaration to a literal value describing the struct's name and fields.)_ |
| 110 | + |
| 111 | +A similar approach could be tried to eagerly build an AST akin to our current one [[*](#ref-1)]. A quick survey of our AST reveals some incompatibilities between the rust-analyzer-style codegen and our use-case: |
| 112 | + |
| 113 | +* In rust-analyzer all AST enums use fieldless variants (`enum Foo { Bar(Bar), Baz(Baz) }`), making codegen easier. sqlparser uses variants with fields, though there was a request to move to fieldless ([#40](https://github.com/andygrove/sqlparser-rs/issues/40)). |
| 114 | + |
| 115 | + In my view, our motivation here was conciseness and inertia (little reason to rewrite, and much effort needed to update - both the library, including the tests, and the consumers). I think this can change. |
| 116 | + |
| 117 | +* RA's codegen assumes that the _type_ of a node usually determines its relation to its parent: different fields in a code-generated struct have to be of different types, as all children of a given type are available from a single "field". Odd cases like `BinExpr`'s `lhs` and `rhs` (both having the type `Expr`) are [implemented manually](https://github.com/rust-analyzer/rust-analyzer/blob/a0be39296d2925972cacd9fbf8b5fb258fad6947/crates/ra_syntax/src/ast/expr_extensions.rs#L195). |
| 118 | + |
| 119 | + It's clear this does not work as well for an AST like ours. In rare cases there's a clear problem with our AST (e.g. `WhenClause` for `CASE` expressions is introduced on this branch), but consider: |
| 120 | + |
| 121 | + pub struct Select { |
| 122 | + //... |
| 123 | + pub projection: Vec<SelectItem>, |
| 124 | + pub where: Option<Expr>, |
| 125 | + pub group_by: Vec<Expr>, |
| 126 | + pub having: Option<Expr>, |
| 127 | + |
| 128 | + The CST for this should probably have separate branches for `WHERE`, `GROUP BY` and `HAVING` at least. Should we introduce additional types like `Where` or make codegen handle this somehow? |
| 129 | + |
| 130 | +* Large portions of the `struct`s in our AST are allocated at once. We use `Box` only where necessary to break the cycles. RA's codegen doesn't have a way to specify these points. |
| 131 | + |
| 132 | +Of course we're not limited to stealing ideas from rust-analyzer, so alternatives can be considered. |
| 133 | + |
| 134 | +* Should we code-gen based on a real AST definition instead of a quasi-Rust code inside a macro like `ast_nodes`? |
| 135 | +* Can `serde` be of help? |
| 136 | + |
| 137 | +I think the design of the CST should be informed by the needs of the AST, so **this is the key question for me.** I've extracted the types and the fields of the current AST into a table (see `ast-stats.js` and `ast-fields.tsv` in `util/`) to help come up with a solution. |
| 138 | + |
| 139 | + |
| 140 | +Other tasks |
| 141 | +----------- |
| 142 | + |
| 143 | +Other than coming up with AST/CST design, there are a number of things to do: |
| 144 | + |
| 145 | +* Upstream the "Support parser backtracking in the GreenNodeBuilder" commit to avoid importing a copy of `GreenNodeBuilder` into sqlparser-rs |
| 146 | +* Setting up the testing infrastructure for the CST (rust-analyzer, again, has some good ideas here) |
| 147 | + |
| 148 | +<!-- the following is copied from the |
| 149 | + "Introduce CST infrastructure based on rowan" commit --> |
| 150 | +- Fix `Token`/`SyntaxKind` duplication, changing the former to |
| 151 | + store a slice of the original source code, e.g. |
| 152 | + `(SyntaxKind, SmolStr)` |
| 153 | + |
| 154 | + This should also fix the currently disabled test-cases where `Token`'s |
| 155 | + `to_string()` does not return the original string: |
| 156 | + |
| 157 | + * `parse_escaped_single_quote_string_predicate` |
| 158 | + * `parse_literal_string` (the case of `HexLiteralString`) |
| 159 | + |
| 160 | +- Fix the hack in `parse_keyword()` to remap the token type (RA has |
| 161 | + `bump_remap() for this) |
| 162 | + |
| 163 | +- Fix the `Parser::pending` hack (needs rethinking parser's API) |
| 164 | + |
| 165 | + Probably related is the issue of handling whitespace and comments: |
| 166 | + the way this prototype handles it looks wrong. |
| 167 | + |
| 168 | +Remarks |
| 169 | +------- |
| 170 | + |
| 171 | +1. <a name="ref-1">[*]</a> During such eager construction of an AST we could also bail on CST nodes that have no place in the typed AST. This seems a part of the possible solution to the dialect problem: this way the parser can recognize a dialect-specific construct, while each consumer can pick which bits they want to support by defining their own typed AST. |
| 172 | + |
| 173 | + |
| 174 | +[ra-syntax]: https://github.com/rust-analyzer/rust-analyzer/blob/master/docs/dev/syntax.md |
| 175 | + |
| 176 | + |
| 177 | + |
| 178 | + |
| 179 | + |
| 180 | + |
| 181 | + |
| 182 | + |
| 183 | + |
| 184 | +# Upstream README |
2 | 185 |
|
3 | 186 | [](https://opensource.org/licenses/Apache-2.0)
|
4 | 187 | [](https://crates.io/crates/sqlparser)
|
|
0 commit comments