From 5bab0b5f6ed81d669f550f85db78720bd536dbd4 Mon Sep 17 00:00:00 2001 From: hussainsultan Date: Tue, 4 Feb 2025 10:32:14 -0500 Subject: [PATCH] chore: copy-edit (dan) --- docs/posts/udf-rewriting/index.qmd | 58 +++++++++++++++--------------- 1 file changed, 29 insertions(+), 29 deletions(-) diff --git a/docs/posts/udf-rewriting/index.qmd b/docs/posts/udf-rewriting/index.qmd index e737064f6741..6c51a8c8d81a 100644 --- a/docs/posts/udf-rewriting/index.qmd +++ b/docs/posts/udf-rewriting/index.qmd @@ -56,11 +56,10 @@ these optimizations add up quickly. ## Smart UDFs with Ibis -**Ibis** is known for letting you write queries in Python without losing the +**Ibis** is known for letting you write engine agnostic deferred expressions in Python without losing the power of underlying engines like Spark, DuckDB, or BigQuery. Meanwhile, -quickgrove provides a mechanism to prune gradient-boosted decision trees based -on known filter conditions for pruning Gradient Boosted Decision Tree (GBDT) -models. +quickgrove provides a mechanism to prune Gradient Boosted Decision Tree (GBDT) models based +on known filter conditions. **Key Ideas**: @@ -86,9 +85,9 @@ approach here for **forests** (GBDTs) using **Ibis.** --- -### Quickgrove: prune-able XGBoost models +### Quickgrove: prune-able GBDT models -Quickgrove is an experimental package that can loads XGBoost JSON models and +Quickgrove is an experimental package that can loads GBDT JSON models and provides a `.prune(...)` API to remove unreachable branches. For example: ```python @@ -99,7 +98,7 @@ model = quickgrove.json_load("diamonds_model.json") # Load an XGBoost model model.prune([quickgrove.Feature("color_i") < 0.2]) # Prune based on known predicate ``` -Once pruned, the model is leaner to evaluate. The results heavily depend on +Once pruned, the model is leaner to evaluate. Note: The results heavily depend on model splits and interactions with predicate pushdowns. --- @@ -107,10 +106,10 @@ model splits and interactions with predicate pushdowns. ## Scalar PyArrow UDFs in Ibis ::: {.column-margin} -Please note that we are using the DataFusion backend. DataFusion backend and -DuckDB backends behave differently in that DuckDB expects a `ChunkedArray` -while DataFusion UDFs expect `ArrayRef`. This case needs to be handled if we -want the same UDF to run in DuckDB backend. +Please note that we are using our own modified DataFusion backend. The +DataFusion backend and DuckDB backend behave differently: DuckDB expects a +`ChunkedArray` while DataFusion UDFs expect `ArrayRef`. We are working on +extending quickgrove to work with the DuckDB backend. ::: We’ll define a simple Ibis UDF that calls our `model.predict_arrays` under the @@ -131,13 +130,12 @@ def predict_gbdt( return model.predict_arrays(array_list) ``` -In its default form, `predict_gbdt` is a black box. Now we need Ibis to -“understand” it enough to let us swap it out for a pruned version under the -right conditions. +Currently, udfs are opaque to Ibis. We need Ibis to teach Ibis how to rewrite a +udf based on predicates it knows about. --- -## Making Ibis predicate-aware +## Making Ibis UDFs predicate-aware Here’s the general process: @@ -321,7 +319,8 @@ result = optimized_expr.to_expr().execute() ``` When this is done, the model inside `predict_gbdt` will be **pruned** based on -your filter conditions. On large datasets, this can yield significant speedups. +the expression's filter conditions. On large datasets, this can yield +significant speedups (see @tbl-perf). --- @@ -346,10 +345,11 @@ Benchmark results: | 5M | 0.82 ±0.02 | 0.67 ±0.02 | 18.0% | | 25M | 4.16 ±0.01 | 3.46 ±0.05 | 16.7% | | 100M | 16.80 ±0.17 | 14.07 ±0.11 | 16.3% | +: Performance improvements {#tbl-perf} **Key takeaway**: As data volume grows, skipping unneeded tree branches can -translate to real savings in both time and compute cost, albeit heavily -dependent on how pertinent the filter conditions might be. +translate to real compute savings, albeit heavily dependent on how pertinent +the filter conditions might be. --- @@ -389,7 +389,8 @@ parts of rewriting your query plan. can extend it to handle `<=`, `>`, `BETWEEN`, or even categorical splits. - **Quickgrove** only supports a handful of objective functions and most notably does not have categorical support yet. In theory, categorical variables -make a better candidates for pruning based on filter conditions. +make a better candidates for pruning based on filter conditions. It only +supports XGBoost format. - **Model Format**: XGBoost JSON is straightforward to parse. Other formats (e.g. LightGBM, scikit-learn trees) require similar logic or conversion steps. - **Edge Cases**: If the filter references columns not in the model features, @@ -404,16 +405,15 @@ filters, the overhead of rewriting might outweigh the benefit. ## Conclusion Combining **Ibis** with a prune-friendly framework like quickgrove lets you -automatically optimize large-scale ML inference inside SQL queries. By -**pushing filter predicates down into your decision trees**, you skip -unnecessary computations and speed up queries significantly. - -**And with LetSQL**, you can streamline this entire process—especially if -you’re looking for an out-of-the-box solution that integrates with multiple -engines along with batteries included features like caching and -aggregate/window UDFs. As next steps, consider experimenting with more complex -models, exploring different tree pruning strategies, or even extending this -pattern to other ML models beyond GBDTs. +optimize large-scale ML inference inside ML workflows. By **pushing filter +predicates down into your decision trees**, you speed up queries significantly. + +**with LetSQL**, you can streamline this entire process—especially if you’re +looking for an out-of-the-box solution that integrates with multiple engines +along with batteries included features like caching and aggregate/window UDFs. +As next steps, consider experimenting with more complex models, exploring +different tree pruning strategies, or even extending this pattern to other ML +models beyond GBDTs. - **Try it out**: Explore the Ibis documentation to learn how to build custom UDFs.