From 5bab0b5f6ed81d669f550f85db78720bd536dbd4 Mon Sep 17 00:00:00 2001
From: hussainsultan <hussainz@gmail.com>
Date: Tue, 4 Feb 2025 10:32:14 -0500
Subject: [PATCH] chore: copy-edit (dan)

---
 docs/posts/udf-rewriting/index.qmd | 58 +++++++++++++++---------------
 1 file changed, 29 insertions(+), 29 deletions(-)

diff --git a/docs/posts/udf-rewriting/index.qmd b/docs/posts/udf-rewriting/index.qmd
index e737064f6741..6c51a8c8d81a 100644
--- a/docs/posts/udf-rewriting/index.qmd
+++ b/docs/posts/udf-rewriting/index.qmd
@@ -56,11 +56,10 @@ these optimizations add up quickly.
 
 ## Smart UDFs with Ibis
 
-**Ibis** is known for letting you write queries in Python without losing the
+**Ibis** is known for letting you write engine agnostic deferred expressions in Python without losing the
 power of underlying engines like Spark, DuckDB, or BigQuery. Meanwhile,
-quickgrove provides a mechanism to prune gradient-boosted decision trees based
-on known filter conditions for pruning Gradient Boosted Decision Tree (GBDT)
-models.
+quickgrove provides a mechanism to prune Gradient Boosted Decision Tree (GBDT) models based
+on known filter conditions.
 
 **Key Ideas**:
 
@@ -86,9 +85,9 @@ approach here for **forests** (GBDTs) using **Ibis.**
 
 ---
 
-### Quickgrove: prune-able XGBoost models
+### Quickgrove: prune-able GBDT models
 
-Quickgrove is an experimental package that can loads XGBoost JSON models and
+Quickgrove is an experimental package that can loads GBDT JSON models and
 provides a `.prune(...)` API to remove unreachable branches. For example:
 
 ```python
@@ -99,7 +98,7 @@ model = quickgrove.json_load("diamonds_model.json")  # Load an XGBoost model
 model.prune([quickgrove.Feature("color_i") < 0.2]) # Prune based on known predicate
 ```
 
-Once pruned, the model is leaner to evaluate. The results heavily depend on
+Once pruned, the model is leaner to evaluate. Note: The results heavily depend on
 model splits and interactions with predicate pushdowns.
 
 ---
@@ -107,10 +106,10 @@ model splits and interactions with predicate pushdowns.
 ## Scalar PyArrow UDFs in Ibis
 
 ::: {.column-margin}
-Please note that we are using the DataFusion backend. DataFusion backend and
-DuckDB backends behave differently in that DuckDB expects a `ChunkedArray`
-while DataFusion UDFs expect `ArrayRef`. This case needs to be handled if we
-want the same UDF to run in DuckDB backend.
+Please note that we are using our own modified DataFusion backend. The
+DataFusion backend and DuckDB backend behave differently: DuckDB expects a
+`ChunkedArray` while DataFusion UDFs expect `ArrayRef`. We are working on
+extending quickgrove to work with the DuckDB backend.
 :::
 
 We’ll define a simple Ibis UDF that calls our `model.predict_arrays` under the
@@ -131,13 +130,12 @@ def predict_gbdt(
     return model.predict_arrays(array_list)
 ```
 
-In its default form, `predict_gbdt` is a black box. Now we need Ibis to
-“understand” it enough to let us swap it out for a pruned version under the
-right conditions.
+Currently, udfs are opaque to Ibis. We need Ibis to teach Ibis how to rewrite a
+udf based on predicates it knows about.
 
 ---
 
-## Making Ibis predicate-aware
+## Making Ibis UDFs predicate-aware
 
 Here’s the general process:
 
@@ -321,7 +319,8 @@ result = optimized_expr.to_expr().execute()
 ```
 
 When this is done, the model inside `predict_gbdt` will be  **pruned** based on
-your filter conditions. On large datasets, this can yield significant speedups.
+the expression's filter conditions. On large datasets, this can yield
+significant speedups (see @tbl-perf).
 
 ---
 
@@ -346,10 +345,11 @@ Benchmark results:
 | 5M | 0.82 ±0.02 | 0.67 ±0.02 | 18.0% |
 | 25M | 4.16 ±0.01 | 3.46 ±0.05 | 16.7% |
 | 100M | 16.80 ±0.17 | 14.07 ±0.11 | 16.3% |
+: Performance improvements {#tbl-perf}
 
 **Key takeaway**: As data volume grows, skipping unneeded tree branches can
-translate to real savings in both time and compute cost, albeit heavily
-dependent on how pertinent the filter conditions might be.
+translate to real compute savings, albeit heavily dependent on how pertinent
+the filter conditions might be.
 
 ---
 
@@ -389,7 +389,8 @@ parts of rewriting your query plan.
 can extend it to handle `<=`, `>`, `BETWEEN`, or even categorical splits.
 - **Quickgrove** only supports a handful of objective functions and most
 notably does not have categorical support yet. In theory, categorical variables
-make a better candidates for pruning based on filter conditions.
+make a better candidates for pruning based on filter conditions. It only
+supports XGBoost format.
 - **Model Format**: XGBoost JSON is straightforward to parse. Other formats
 (e.g. LightGBM, scikit-learn trees) require similar logic or conversion steps.
 - **Edge Cases**: If the filter references columns not in the model features,
@@ -404,16 +405,15 @@ filters, the overhead of rewriting might outweigh the benefit.
 ## Conclusion
 
 Combining **Ibis** with a prune-friendly framework like quickgrove lets you
-automatically optimize large-scale ML inference inside SQL queries. By
-**pushing filter predicates down into your decision trees**, you skip
-unnecessary computations and speed up queries significantly.
-
-**And with LetSQL**, you can streamline this entire process—especially if
-you’re looking for an out-of-the-box solution that integrates with multiple
-engines along with batteries included features like caching and
-aggregate/window UDFs. As next steps, consider experimenting with more complex
-models, exploring different tree pruning strategies, or even extending this
-pattern to other ML models beyond GBDTs.
+optimize large-scale ML inference inside ML workflows. By **pushing filter
+predicates down into your decision trees**, you speed up queries significantly.
+
+**with LetSQL**, you can streamline this entire process—especially if you’re
+looking for an out-of-the-box solution that integrates with multiple engines
+along with batteries included features like caching and aggregate/window UDFs.
+As next steps, consider experimenting with more complex models, exploring
+different tree pruning strategies, or even extending this pattern to other ML
+models beyond GBDTs.
 
 - **Try it out**: Explore the Ibis documentation to learn how to build custom
 UDFs.