Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(blog): add post on SQL understanding and Ibis #10762

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
210 changes: 210 additions & 0 deletions docs/posts/does-ibis-understand-sql/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
---
title: "Does Ibis understand SQL?"
author: "Deepyaman Datta"
date: "2025-02-06"
image: thumbnail.webp
categories:
- blog
- internals
- sql
---

Last month, an [insightful article on the dbt Developer Blog on what SQL comprehension really means](https://docs.getdbt.com/blog/the-levels-of-sql-comprehension)
came across my LinkedIn feed. The big deal about SDF is that it, unlike dbt, actually _understands_
SQL. As an Ibis user and contributor, several of the concepts covered in the post were familiar—in
fact, I first learned about Ibis because the product I was working on required an
[intermediate representation](https://en.wikipedia.org/wiki/Intermediate_representation) that could
be compiled to Flink SQL code. In that case, as a dataframe library that interfaces with databases,
does Ibis also understand SQL?

## Tl;dr

Ibis doesn't understand SQL per se, but it does understand what you're trying to do. Ibis, much like
SQL, defines a standardized interface for working with databases. Because Ibis understands queries
expressed through this user interface, it also provides users with some of the unique capabilities
SDF offers, including the ability to execute said logic on the backend of the user's choice.

## How does Ibis work?

To answer the question of whether Ibis understands SQL, we first need to understand the internals of
Ibis. Specifically, how is the code that users write with Ibis eventually executed on a SQL backend?

### Building an expression

Ibis provides a dataframe API for writing expressions. A
[follow-up article on the dbt Developer Blog on the key technologies behind SQL comprehension](https://docs.getdbt.com/blog/sql-comprehension-technologies)
used the following SQL query in illustrating what the parser and compiler do:

```sql
select x as u from t where x > 0
```

In SQL, the _binder_ adds type information to the syntax tree produced by the _parser_. This order
of operations differs from the way Ibis works; in Ibis, [`Node`](../../concepts/internals.qmd#the-ibis.expr.types.node-class)s—
the core operations that can be applied to expressions, such as `ibis.expr.operations.Add` and
`ibis.expr.operations.WindowFunction`—must be applied to [`Expr`](../../concepts/internals.qmd#the-expr-class)
objects containing data type and shape information.

The [`Table`](../../reference/expression-tables.qmd#ibis.expr.types.relations.Table) is one of the
core Ibis data structures, analogous to a SQL table. It's also an `Expr` subclass. We begin by
manually defining a `Table` with our desired schema here, but one can also construct a table from an
existing database table, file, or in-memory data representation:

```{python}
import ibis

t = ibis.table(dict(x="int32", y="float", z="string"), name="t")
```

Next, we apply a filter and rename the `x` column as in the SQL query above:

```{python}
expr = t.filter(t.x > 0).select(t.x.name("u"))
```

In SQL, the _parser_ translates the query into a syntax tree, but in Ibis, expressions are
inherently represented as a tree of `Expr` and `Node` objects. Ibis enables users to [`visualize()`](../../reference/expression-tables.qmd#ibis.expr.types.relations.Table.visualize)
this intermediate representation for any expression:

```{python}
from ibis.expr.visualize import to_graph

to_graph(expr)
```

In non-interactive mode, pretty-printing the expression yields an equivalent textual representation:

```{python}
expr
```

Look at that! Unsurprisingly, the resulting `repr()` matches the generated logical plan from the SQL
comprehension technologies article:

![](logical_plan.png)

Note that the abstract syntax tree is an artifact of the parsing step and has no direct Ibis analog.

### Compiling the expression

In the past, Ibis would compile expressions to SQL using its own SQL generation logic. However, with
the [completion of "the big refactor" in Ibis 9.0](https://ibis-project.org/posts/ibis-version-9.0.0-release/),
Ibis fully transitioned to producing [SQLGlot](https://github.com/tobymao/sqlglot) expressions under
the hood.

We can see the intermediate SQLGlot representation of our expression using the `to_sqlglot()` method
on an Ibis backend compiler implementation:

```{python}
from ibis.backends.sql.compilers.duckdb import compiler

query = compiler.to_sqlglot(expr)
query
```

Ibis delegates SQL generation to SQLGlot, essentially calling the `sql()` method on the above query:

```{python}
query.sql()
```

Ibis also provides a top-level [`to_sql()`](../../reference/expression-generic.qmd#ibis.to_sql)
method, so most users don't need to be aware of SQLGlot or the inner workings of Ibis expression
compilation—unless they want to:

```{python}
ibis.to_sql(expr)
```

::: {.callout-tip}
## Why SQLGlot?

[SQLGlot is a no-dependency SQL parser, transpiler, optimizer, and engine.](https://sqlglot.com/sqlglot.html)
It's a widely-used open-source project that powers the SQL comprehension and generation capabilities
of tools like [SQLMesh](https://github.com/TobikoData/sqlmesh), [Apache Superset](https://github.com/apache/superset),
and [Dagster](https://github.com/dagster-io/dagster). In fact, SQLGlot is also the engine behind the
column-level lineage feature available in dbt Cloud!

The specifics of why Ibis chose SQLGlot are beyond the scope of this article, but you can learn more
about the reasoning from a [GitHub discussion on moving the SQL backends from SQLAlchemy to SQLGlot](https://github.com/ibis-project/ibis/discussions/7213).
:::

### Executing the compiled expression

Execution is the most straightforward part of the process. For most Ibis-supported SQL backends, the
`execute()` method uses the database connection associated with the backend instance—usually managed
by the underlying Python client library—to submit and fetch results for the compiled query. Last but
not least, Ibis massages the returned data into the desired format (e.g. a pandas DataFrame for easy
consumption). Because this final processing step falls outside the expression-understanding process,
we'll end our journey through the Ibis execution flow here.

Note that the database still performs all of the activities covered by the SQL comprehension levels;
the database is completely oblivious to whatever Ibis did prior to providing the raw SQL to execute.

## So, does Ibis understand SQL?

Ibis's expressive dataframe API lets users avoid writing handcrafted SQL queries in most situations,
but there are still cases where you may need to [use SQL strings in your Ibis code](../../how-to/extending/sql).
[`Table.sql()`](../../reference/expression-tables.qmd#ibis.expr.types.relations.Table.sql) and
[`Backend.sql()`](../../backends/duckdb.qmd#ibis.backends.duckdb.Backend.sql) work very similarly to
each other. To explore both of these options, we'll need to initialize a backend connection (DuckDB
works perfectly for this purpose) and register our table:

```{python}
con = ibis.duckdb.connect()
t = con.create_table("t", t)
```

Now we can use the `Table.sql()` method to handle our SQL query:

```{python}
expr = t.sql("select x as u from t where x > 0")
expr
```

Note that the expression tree doesn't have the same level of detail as the one we built using the
dataframe API in the first section. Instead, a [`SQLStringView`](../../reference/operations.qmd#ibis.expr.operations.relations.SQLStringView)
node encapsulates the query. Ibis only understands the output schema for the operation, which it
uses to validate any downstream operations.

The main difference between `Table.sql()` and `Backend.sql()` is that `Backend.sql()` can only refer
to tables that already exist in the database. We see that reflected in the expression tree `repr()`;
the `DatabaseTable` node is not present in the `Backend.sql()` case, and a [`SQLQueryResult`](../../reference/operations.qmd#ibis.expr.operations.relations.SQLQueryResult)
node (that does not contain a reference to another Ibis relation) replaces the `SQLStringView` node:

```{python}
expr = con.sql("select x as u from t where x > 0")
expr
```

That said, the intermediate SQLGlot representations are identical for both alternatives:

```{python}
con.compiler.to_sqlglot(expr)
```

In fact, the resulting SQLGlot expression is only slightly, nonfunctionally different from the one
we got from the dataframe API. Similarly, the compiled SQL is identical except for the table alias:

```{python}
ibis.to_sql(expr)
```

In short, while Ibis doesn't understand the inner structure of SQL queries passed to `Table.sql()`
or `Backend.sql()`, it still compiles to SQLGlot expressions under the hood, which means SQLGlot's
capabilities (like producing column-level lineage) are still applicable.

The third option for writing SQL strings in Ibis, [`Backend.raw_sql()`](../../backends/duckdb.qmd#ibis.backends.duckdb.Backend.raw_sql),
is opaque and only exists for situations where the user needs to run arbitrary SQL code. Ibis simply
executes the code and returns the associated cursor; it does not attempt to understand the SQL.

## Further reading

In this article, we explored how Ibis constructs expressions and compiles them to SQL using SQLGlot.
Even as somebody who has contributed significantly to Ibis over the years, I learned a lot about the
inner workings of the library. While it was too much to cover in a single blog post, I'll end with a
few resources I found interesting while drafting this article for those interested in diving deeper:

- [Phillip Cloud on why Ibis moved away from SQLAlchemy to SQLGlot](https://github.com/ibis-project/ibis/discussions/7213)
- [Toby Mao on the way SQLGlot computes column-level lineage](https://www.linkedin.com/posts/toby-mao_the-way-sqlglot-computes-column-level-lineage-activity-7166683013311852547-on21/)
- [Krisztián Szűcs's PR that split the relational operations and created a new Ibis expression IR](https://github.com/ibis-project/ibis/pull/7752)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.