|
1 | 1 | ---
|
2 | 2 | title: "Using IbisML and DuckDB for a Kaggle competition: credit risk model stability"
|
3 | 3 | author: "Jiting Xu"
|
4 |
| -date: "2024-08-15" |
| 4 | +date: "2024-08-21" |
5 | 5 | categories:
|
6 | 6 | - blog
|
7 |
| - - DuckDB |
| 7 | + - duckdb |
8 | 8 | - machine learning
|
9 | 9 | - feature engineering
|
10 |
| -execute: |
11 |
| - freeze: auto |
12 | 10 | ---
|
13 | 11 |
|
14 | 12 | ## Introduction
|
15 |
| -In this post, we'll demonstrate how to use Ibis and IbisML end-to-end for the |
| 13 | +In this post, we'll demonstrate how to use Ibis and [IbisML](https://github.com/ibis-project/ibis-ml) |
| 14 | +end-to-end for the |
16 | 15 | [credit risk model stability Kaggle competition](https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability).
|
17 | 16 |
|
18 | 17 | 1. Load data and perform feature engineering on DuckDB backend using IbisML
|
19 | 18 | 2. Perform last-mile ML data preprocessing on DuckDB backend using IbisML
|
20 | 19 | 3. Train two models using different frameworks:
|
21 | 20 | * An XGBoost model within a scikit-learn pipeline.
|
22 |
| - * A neural network with PyTorch and PyTorch Lightning |
| 21 | + * A neural network with PyTorch and PyTorch Lightning. |
23 | 22 |
|
24 | 23 | The aim of this competition is to predict which clients are more likely to default on their
|
25 | 24 | loans by using both internal and external data sources.
|
@@ -93,6 +92,8 @@ ibis.options.interactive = True
|
93 | 92 | Set the backend for computing:
|
94 | 93 | ```{python}
|
95 | 94 | con = ibis.duckdb.connect()
|
| 95 | +# remove the black bars from duckdb's progress bar |
| 96 | +con.raw_sql("set enable_progress_bar = false") |
96 | 97 | # DuckDB is the default backend for Ibis
|
97 | 98 | ibis.set_backend(con)
|
98 | 99 | ```
|
@@ -612,7 +613,7 @@ Calculate all the days difference between any date columns and the column `date_
|
612 | 613 | #| code-summary: "Show code to calculate days difference between date columns and date_decision"
|
613 | 614 | date_cols = [col_name for col_name in df_train.columns if col_name[-1] == "D"]
|
614 | 615 | days_to_decision_expr = {
|
615 |
| - # Difference in days |
| 616 | + # difference in days |
616 | 617 | f"{col}_date_decision_diff": (
|
617 | 618 | _.date_decision.epoch_seconds() - getattr(_, col).epoch_seconds()
|
618 | 619 | )
|
|
0 commit comments