Skip to content

Version 3.0.0 Proposal draft with sqlglot as the parsing library#617

Draft
collerek wants to merge 21 commits intomasterfrom
feature/v-3-draft
Draft

Version 3.0.0 Proposal draft with sqlglot as the parsing library#617
collerek wants to merge 21 commits intomasterfrom
feature/v-3-draft

Conversation

@collerek
Copy link
Copy Markdown
Collaborator

@collerek collerek commented Mar 31, 2026

---STILL WORK IN PROGRES---

Major rewrite of sql-metadata's parsing engine from token-based to sqlglot AST-based architecture (v3).

Refactor:

  • Good starting point is ARCHITECTURE.md file
  • Replaced token-based parser and sqlparse with sqlglot AST pipeline — raw SQL flows through SqlCleaner → DialectParser → sqlglot AST → specialized extractors, replacing manual tokenization
    and keyword matching
  • Decomposed monolithic parser.py into focused extractor classes — ColumnExtractor, TableExtractor, NestedResolver, QueryTypeExtractor, DialectParser, SqlCleaner each own a
    single concern, composed by a thin Parser facade
  • Added multi-dialect auto-detection — tries multiple sqlglot dialects (MySQL, TSQL, Spark, custom HashVarDialect) and picks the first non-degraded result
  • Single-pass DFS column extraction — walks AST in arg_types key order preserving SQL text order, replacing multi-pass token scanning
  • Removed legacy modules — token.py (token-based extraction), compat.py (v1 compatibility layer), .flake8 config

Feature:

  • Added MERGE query type support

Admin:

  • Added mypy with strict settings (disallow_untyped_defs, check_untyped_defs, warn_return_any) and fixed all type errors across 13 source files
  • Added make type_check command and integrated mypy into CI workflow
  • Switched linting/formatting fully to ruff — removed black workflow, black pre-install step, and pylint references from CI
  • Added py.typed PEP 561 marker for downstream type checker support
  • Added ARCHITECTURE.md with Mermaid diagrams, traced walkthroughs, and module deep dives, updated agents.md to reflect the rewrite

Resolved issues

Disclaimer: The PR was written with a help of Claude althouth required a lot of manual fixes too ;)

collerek added 15 commits March 25, 2026 16:37
…qlglot parses it, so sqlglot produces a proper exp.Insert AST instead of exp.Command and parses it correctly without falling back to regex
…qlglot parses it, so sqlglot produces a proper exp.Insert AST instead of exp.Command and parses it correctly without falling back to regex
…ts from open issues to verify if it's handling the issues better than the old version, remove internal tokens and produce only list of strings if needed, remove compatibility layer to v1
@collerek
Copy link
Copy Markdown
Collaborator Author

@macbre started a proof of concept rewrite to replace quite stale and slow sqlparse with sqlglot as we had a convo with Toby some (quite long) time ago.

Let me know how you feel about that in general? (The idea - not the code details yet)

Seems we can close quite a lot of open issues, but replacing sqlparse was harder than I anticipated, as we do a lot of other things it seems.

Note it's still a work in progress. Was also working on it with Claude but it required quite some iterations and manual fixes anyway.

@macbre
Copy link
Copy Markdown
Owner

macbre commented Mar 31, 2026

Sure, go for it 🚀 and welcome back!

…er for now mark nocover as unreachable from parser and this is the only entrypoint we want for majority of the tests
@socket-security
Copy link
Copy Markdown

socket-security bot commented Mar 31, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedmypy@​1.19.182100100100100
Addedlibrt@​0.8.199100100100100
Addedruff@​0.11.13100100100100100
Updatedtyping-extensions@​4.13.2 ⏵ 4.15.0100100100100100
Addedruff@​0.15.8100100100100100
Addedsqlglot@​30.0.3100100100100100
Addedsqlglot@​30.1.0100100100100100

View full report

collerek added 4 commits April 1, 2026 18:22
…es from queries with subscripts, some additional issues that were already fixed were documented by tests, some cleanup and refactor to decrease unreachable paths
…w star or star with table when prefixed with table name/alias - unreachable code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment