Skip to content

Commit

Permalink
chore: copy-edit (dan)
Browse files Browse the repository at this point in the history
  • Loading branch information
hussainsultan committed Feb 4, 2025
1 parent 1c75a69 commit 5bab0b5
Showing 1 changed file with 29 additions and 29 deletions.
58 changes: 29 additions & 29 deletions docs/posts/udf-rewriting/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -56,11 +56,10 @@ these optimizations add up quickly.

## Smart UDFs with Ibis

**Ibis** is known for letting you write queries in Python without losing the
**Ibis** is known for letting you write engine agnostic deferred expressions in Python without losing the
power of underlying engines like Spark, DuckDB, or BigQuery. Meanwhile,
quickgrove provides a mechanism to prune gradient-boosted decision trees based
on known filter conditions for pruning Gradient Boosted Decision Tree (GBDT)
models.
quickgrove provides a mechanism to prune Gradient Boosted Decision Tree (GBDT) models based
on known filter conditions.

**Key Ideas**:

Expand All @@ -86,9 +85,9 @@ approach here for **forests** (GBDTs) using **Ibis.**

---

### Quickgrove: prune-able XGBoost models
### Quickgrove: prune-able GBDT models

Quickgrove is an experimental package that can loads XGBoost JSON models and
Quickgrove is an experimental package that can loads GBDT JSON models and
provides a `.prune(...)` API to remove unreachable branches. For example:

```python
Expand All @@ -99,18 +98,18 @@ model = quickgrove.json_load("diamonds_model.json") # Load an XGBoost model
model.prune([quickgrove.Feature("color_i") < 0.2]) # Prune based on known predicate
```

Once pruned, the model is leaner to evaluate. The results heavily depend on
Once pruned, the model is leaner to evaluate. Note: The results heavily depend on
model splits and interactions with predicate pushdowns.

---

## Scalar PyArrow UDFs in Ibis

::: {.column-margin}
Please note that we are using the DataFusion backend. DataFusion backend and
DuckDB backends behave differently in that DuckDB expects a `ChunkedArray`
while DataFusion UDFs expect `ArrayRef`. This case needs to be handled if we
want the same UDF to run in DuckDB backend.
Please note that we are using our own modified DataFusion backend. The
DataFusion backend and DuckDB backend behave differently: DuckDB expects a
`ChunkedArray` while DataFusion UDFs expect `ArrayRef`. We are working on
extending quickgrove to work with the DuckDB backend.
:::

We’ll define a simple Ibis UDF that calls our `model.predict_arrays` under the
Expand All @@ -131,13 +130,12 @@ def predict_gbdt(
return model.predict_arrays(array_list)
```

In its default form, `predict_gbdt` is a black box. Now we need Ibis to
“understand” it enough to let us swap it out for a pruned version under the
right conditions.
Currently, udfs are opaque to Ibis. We need Ibis to teach Ibis how to rewrite a
udf based on predicates it knows about.

---

## Making Ibis predicate-aware
## Making Ibis UDFs predicate-aware

Here’s the general process:

Expand Down Expand Up @@ -321,7 +319,8 @@ result = optimized_expr.to_expr().execute()
```
When this is done, the model inside `predict_gbdt` will be **pruned** based on
your filter conditions. On large datasets, this can yield significant speedups.
the expression's filter conditions. On large datasets, this can yield
significant speedups (see @tbl-perf).
---
Expand All @@ -346,10 +345,11 @@ Benchmark results:
| 5M | 0.82 ±0.02 | 0.67 ±0.02 | 18.0% |
| 25M | 4.16 ±0.01 | 3.46 ±0.05 | 16.7% |
| 100M | 16.80 ±0.17 | 14.07 ±0.11 | 16.3% |
: Performance improvements {#tbl-perf}
**Key takeaway**: As data volume grows, skipping unneeded tree branches can
translate to real savings in both time and compute cost, albeit heavily
dependent on how pertinent the filter conditions might be.
translate to real compute savings, albeit heavily dependent on how pertinent
the filter conditions might be.
---
Expand Down Expand Up @@ -389,7 +389,8 @@ parts of rewriting your query plan.
can extend it to handle `<=`, `>`, `BETWEEN`, or even categorical splits.
- **Quickgrove** only supports a handful of objective functions and most
notably does not have categorical support yet. In theory, categorical variables
make a better candidates for pruning based on filter conditions.
make a better candidates for pruning based on filter conditions. It only
supports XGBoost format.
- **Model Format**: XGBoost JSON is straightforward to parse. Other formats
(e.g. LightGBM, scikit-learn trees) require similar logic or conversion steps.
- **Edge Cases**: If the filter references columns not in the model features,
Expand All @@ -404,16 +405,15 @@ filters, the overhead of rewriting might outweigh the benefit.
## Conclusion
Combining **Ibis** with a prune-friendly framework like quickgrove lets you
automatically optimize large-scale ML inference inside SQL queries. By
**pushing filter predicates down into your decision trees**, you skip
unnecessary computations and speed up queries significantly.
**And with LetSQL**, you can streamline this entire process—especially if
you’re looking for an out-of-the-box solution that integrates with multiple
engines along with batteries included features like caching and
aggregate/window UDFs. As next steps, consider experimenting with more complex
models, exploring different tree pruning strategies, or even extending this
pattern to other ML models beyond GBDTs.
optimize large-scale ML inference inside ML workflows. By **pushing filter
predicates down into your decision trees**, you speed up queries significantly.
**with LetSQL**, you can streamline this entire process—especially if you’re
looking for an out-of-the-box solution that integrates with multiple engines
along with batteries included features like caching and aggregate/window UDFs.
As next steps, consider experimenting with more complex models, exploring
different tree pruning strategies, or even extending this pattern to other ML
models beyond GBDTs.
- **Try it out**: Explore the Ibis documentation to learn how to build custom
UDFs.
Expand Down

0 comments on commit 5bab0b5

Please sign in to comment.