Skip to content

Commit 2e3a2ed

Browse files
feat: convert marimo notebooks with nbstripout setup
1 parent 4e83c38 commit 2e3a2ed

23 files changed

+3052
-938
lines changed

.pre-commit-config.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,9 @@
11
repos:
2+
- repo: https://github.com/kynan/nbstripout
3+
rev: 0.8.1
4+
hooks:
5+
- id: nbstripout
6+
args: [--drop-empty-cells]
27
- repo: https://github.com/charliermarsh/ruff-pre-commit
38
rev: v0.11.6
49
hooks:

README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -29,16 +29,16 @@ This repository is a curated collection of data science articles from CodeCut, c
2929
| Python Helper Tools | Introducing FugueSQL — SQL for Pandas, Spark, and Dask DataFrames | [🔗](https://codecut.ai/introducing-fuguesql-sql-for-pandas-spark-and-dask-dataframes-2/?utm_source=github&utm_medium=data_science_repo&utm_campaign=blog) | [🔗](https://github.com/codecuttech/Data-science/blob/master/data_science_tools/fugueSQL.ipynb) | |
3030
| Python Helper Tools | Fugue and DuckDB: Fast SQL Code in Python | [🔗](https://codecut.ai/fugue-and-duckdb-fast-sql-code-in-python-2/?utm_source=github&utm_medium=data_science_repo&utm_campaign=blog) | [🔗](https://github.com/codecuttech/Data-science/blob/master/productive_tools/Fugue_and_Duckdb/Fugue_and_Duckdb.ipynb) | |
3131
| Python Helper Tools | Marimo: A Modern Notebook for Reproducible Data Science | [🔗](https://codecut.ai/marimo-a-modern-notebook-for-reproducible-data-science/?utm_source=github&utm_medium=data_science_repo&utm_campaign=blog) | [🔗](https://github.com/codecuttech/Data-science/tree/master/data_science_tools/marimo_examples) | |
32-
| Feature Engineering | Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames | [🔗](https://codecut.ai/polars-vs-pandas-a-fast-multi-core-alternative-for-dataframes/?utm_source=github&utm_medium=data_science_repo&utm_campaign=blog) | [🔗](https://codecuttech.github.io/Data-science/data_science_tools/polars_vs_pandas.html) | |
32+
| Feature Engineering | Polars vs. Pandas: A Fast, Multi-Core Alternative for DataFrames | [🔗](https://codecut.ai/polars-vs-pandas-a-fast-multi-core-alternative-for-dataframes/?utm_source=github&utm_medium=data_science_repo&utm_campaign=blog) | [🔗](https://github.com/codecuttech/Data-science/blob/master/data_science_tools/polars_vs_pandas.ipynb) | |
3333
| Visualization | Top 6 Python Libraries for Visualization: Which one to Use? | [🔗](https://codecut.ai/top-6-python-libraries-for-visualization-which-one-to-use/?utm_source=github&utm_medium=data_science_repo&utm_campaign=blog) | [🔗](https://github.com/codecuttech/Data-science/tree/master/visualization/top_visualization.ipynb) | |
3434
| Python | Python Clean Code: 6 Best Practices to Make Your Python Functions More Readable | [🔗](https://codecut.ai/python-clean-code-6-best-practices-to-make-your-python-functions-more-readable-2/?utm_source=github&utm_medium=data_science_repo&utm_campaign=blog) | [🔗](https://github.com/codecuttech/Data-science/tree/master/python/good_functions) | [🔗](https://youtu.be/IDHD8JYBl5M) |
3535
| Logging and Debugging | Loguru: Simple as Print, Flexible as Logging | [🔗](https://codecut.ai/simplify-your-python-logging-with-loguru/?utm_source=github&utm_medium=data_science_repo&utm_campaign=blog) | [🔗](https://github.com/codecuttech/Data-science/tree/master/productive_tools/logging_tools) | [🔗](https://youtu.be/XY_OrUoR-HU) |
36-
| LLM | Enforce Structured Outputs from LLMs with PydanticAI | [🔗](https://codecut.ai/enforce-structured-outputs-from-llms-with-pydanticai/?utm_source=github&utm_medium=data_science_repo&utm_campaign=blog) | [🔗](https://codecuttech.github.io/Data-science/llm/pydantic_ai_examples.html) | |
37-
| LLM | Run Private AI Workflows with LangChain and Ollama | [🔗](https://codecut.ai/private-ai-workflows-langchain-ollama/?utm_source=github&utm_medium=data_science_repo&utm_campaign=blog) | [🔗](https://codecuttech.github.io/Data-science/llm/lchain_ollama.html) | |
38-
| Speed-up Tools | Writing Safer PySpark Queries with Parameters | [🔗](https://codecut.ai/pyspark-sql-enhancing-reusability-with-parameterized-queries/) | [🔗](https://codecuttech.github.io/Data-science/data_science_tools/pyspark_parametrize.html) | |
39-
| Speed-up Tools | Narwhals: Unified DataFrame Functions for pandas, Polars, and PySpark | [🔗](https://codecut.ai/unified-dataframe-functions-pandas-polars-pyspark/) | [🔗](https://codecuttech.github.io/Data-science/data_science_tools/narwhals.html) | |
40-
| Speed-up Tools | Eager to Lazy DataFrames with Narwhals | [🔗](https://codecut.ai/eager-to-lazy-dataframes-with-narwhals/) | [🔗](https://codecuttech.github.io/Data-science/data_science_tools/narwhals_row_ordering.html) | |
41-
| Speed-up Tools | Scaling Pandas Workflows with PySpark's Pandas API | [🔗](https://codecut.ai/scaling-pandas-workflows-with-pysparks-pandas-api/) | [🔗](https://codecuttech.github.io/Data-science/data_science_tools/pandas_api_on_spark.html) | |
36+
| LLM | Enforce Structured Outputs from LLMs with PydanticAI | [🔗](https://codecut.ai/enforce-structured-outputs-from-llms-with-pydanticai/?utm_source=github&utm_medium=data_science_repo&utm_campaign=blog) | [🔗](https://github.com/codecuttech/Data-science/blob/master/llm/pydantic_ai_examples.ipynb) | |
37+
| LLM | Run Private AI Workflows with LangChain and Ollama | [🔗](https://codecut.ai/private-ai-workflows-langchain-ollama/?utm_source=github&utm_medium=data_science_repo&utm_campaign=blog) | [🔗](https://github.com/codecuttech/Data-science/blob/master/llm/langchain_ollama.ipynb) | |
38+
| Speed-up Tools | Writing Safer PySpark Queries with Parameters | [🔗](https://codecut.ai/pyspark-sql-enhancing-reusability-with-parameterized-queries/) | [🔗](https://github.com/codecuttech/Data-science/blob/master/data_science_tools/pyspark_parametrize.ipynb) | |
39+
| Speed-up Tools | Narwhals: Unified DataFrame Functions for pandas, Polars, and PySpark | [🔗](https://codecut.ai/unified-dataframe-functions-pandas-polars-pyspark/) | [🔗](https://github.com/codecuttech/Data-science/blob/master/data_science_tools/narwhals.ipynb) | |
40+
| Speed-up Tools | Eager to Lazy DataFrames with Narwhals | [🔗](https://codecut.ai/eager-to-lazy-dataframes-with-narwhals/) | [🔗](https://github.com/codecuttech/Data-science/blob/master/data_science_tools/narwhals_row_ordering.ipynb) | |
41+
| Speed-up Tools | Scaling Pandas Workflows with PySpark's Pandas API | [🔗](https://codecut.ai/scaling-pandas-workflows-with-pysparks-pandas-api/) | [🔗](https://github.com/codecuttech/Data-science/blob/master/data_science_tools/pandas_api_on_spark.ipynb) | |
4242

4343
## Contributing
4444

data_science_tools/narwhals.ipynb

Lines changed: 301 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,301 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "0",
6+
"metadata": {
7+
"marimo": {
8+
"config": {
9+
"hide_code": true
10+
}
11+
}
12+
},
13+
"source": [
14+
"# Motivation"
15+
]
16+
},
17+
{
18+
"cell_type": "code",
19+
"execution_count": null,
20+
"id": "1",
21+
"metadata": {},
22+
"outputs": [],
23+
"source": [
24+
"from datetime import datetime\n",
25+
"\n",
26+
"import pandas as pd\n",
27+
"\n",
28+
"df = pd.DataFrame(\n",
29+
" {\n",
30+
" \"date\": [datetime(2020, 1, 1), datetime(2020, 1, 8), datetime(2020, 2, 3)],\n",
31+
" \"price\": [1, 4, 3],\n",
32+
" }\n",
33+
")\n",
34+
"df"
35+
]
36+
},
37+
{
38+
"cell_type": "code",
39+
"execution_count": null,
40+
"id": "2",
41+
"metadata": {},
42+
"outputs": [],
43+
"source": [
44+
"def monthly_aggregate_pandas(user_df):\n",
45+
" return user_df.resample(\"MS\", on=\"date\")[[\"price\"]].mean()\n",
46+
"\n",
47+
"monthly_aggregate_pandas(df)"
48+
]
49+
},
50+
{
51+
"cell_type": "markdown",
52+
"id": "3",
53+
"metadata": {
54+
"marimo": {
55+
"config": {
56+
"hide_code": true
57+
}
58+
}
59+
},
60+
"source": [
61+
"# Dataframe-agnostic data science\n",
62+
"\n",
63+
"\n",
64+
"## Bad solution: just convert to pandas"
65+
]
66+
},
67+
{
68+
"cell_type": "code",
69+
"execution_count": null,
70+
"id": "4",
71+
"metadata": {},
72+
"outputs": [],
73+
"source": [
74+
"import duckdb\n",
75+
"import polars as pl\n",
76+
"import pyarrow as pa\n",
77+
"import pyspark\n",
78+
"import pyspark.sql.functions as F\n",
79+
"from pyspark.sql import SparkSession\n"
80+
]
81+
},
82+
{
83+
"cell_type": "code",
84+
"execution_count": null,
85+
"id": "5",
86+
"metadata": {},
87+
"outputs": [],
88+
"source": [
89+
"def monthly_aggregate_bad(user_df):\n",
90+
" if isinstance(user_df, pd.DataFrame):\n",
91+
" df = user_df\n",
92+
" elif isinstance(user_df, pl.DataFrame):\n",
93+
" df = user_df.to_pandas()\n",
94+
" elif isinstance(user_df, duckdb.DuckDBPyRelation):\n",
95+
" df = user_df.df()\n",
96+
" elif isinstance(user_df, pa.Table):\n",
97+
" df = user_df.to_pandas()\n",
98+
" elif isinstance(user_df, pyspark.sql.dataframe.DataFrame):\n",
99+
" df = user_df.toPandas()\n",
100+
" else:\n",
101+
" raise TypeError(\"Unsupported DataFrame type: cannot convert to pandas\")\n",
102+
"\n",
103+
" return df.resample(\"MS\", on=\"date\")[[\"price\"]].mean()\n"
104+
]
105+
},
106+
{
107+
"cell_type": "code",
108+
"execution_count": null,
109+
"id": "6",
110+
"metadata": {},
111+
"outputs": [],
112+
"source": [
113+
"data = {\n",
114+
" \"date\": [datetime(2020, 1, 1), datetime(2020, 1, 8), datetime(2020, 2, 3)],\n",
115+
" \"price\": [1, 4, 3],\n",
116+
"}"
117+
]
118+
},
119+
{
120+
"cell_type": "code",
121+
"execution_count": null,
122+
"id": "7",
123+
"metadata": {},
124+
"outputs": [],
125+
"source": [
126+
"# pandas\n",
127+
"pandas_df = pd.DataFrame(data)\n",
128+
"monthly_aggregate_bad(pandas_df)\n",
129+
"\n",
130+
"# polars\n",
131+
"polars_df = pl.DataFrame(data)\n",
132+
"monthly_aggregate_bad(polars_df)\n",
133+
"\n",
134+
"# duckdb\n",
135+
"duckdb_df = duckdb.from_df(pandas_df)\n",
136+
"monthly_aggregate_bad(duckdb_df)\n",
137+
"\n",
138+
"# pyspark\n",
139+
"spark = SparkSession.builder.getOrCreate()\n",
140+
"spark_df = spark.createDataFrame(pandas_df)\n",
141+
"monthly_aggregate_bad(spark_df)\n",
142+
"\n",
143+
"# pyarrow\n",
144+
"arrow_table = pa.table(data)\n",
145+
"monthly_aggregate_bad(arrow_table)"
146+
]
147+
},
148+
{
149+
"cell_type": "markdown",
150+
"id": "8",
151+
"metadata": {
152+
"marimo": {
153+
"config": {
154+
"hide_code": true
155+
}
156+
}
157+
},
158+
"source": [
159+
"## Unmaintainable solution: different branches for each library"
160+
]
161+
},
162+
{
163+
"cell_type": "code",
164+
"execution_count": null,
165+
"id": "9",
166+
"metadata": {},
167+
"outputs": [],
168+
"source": [
169+
"def monthly_aggregate_unmaintainable(user_df):\n",
170+
" if isinstance(user_df, pd.DataFrame):\n",
171+
" result = user_df.resample(\"MS\", on=\"date\")[[\"price\"]].mean()\n",
172+
" elif isinstance(user_df, pl.DataFrame):\n",
173+
" result = (\n",
174+
" user_df.group_by(pl.col(\"date\").dt.truncate(\"1mo\"))\n",
175+
" .agg(pl.col(\"price\").mean())\n",
176+
" .sort(\"date\")\n",
177+
" )\n",
178+
" elif isinstance(user_df, pyspark.sql.dataframe.DataFrame):\n",
179+
" result = (\n",
180+
" user_df.withColumn(\"date_month\", F.date_trunc(\"month\", F.col(\"date\")))\n",
181+
" .groupBy(\"date_month\")\n",
182+
" .agg(F.mean(\"price\").alias(\"price_mean\"))\n",
183+
" .orderBy(\"date_month\")\n",
184+
" )\n",
185+
" # TODO: more branches for DuckDB, PyArrow, Dask, etc... :sob:\n",
186+
" return result\n"
187+
]
188+
},
189+
{
190+
"cell_type": "code",
191+
"execution_count": null,
192+
"id": "10",
193+
"metadata": {},
194+
"outputs": [],
195+
"source": [
196+
"# pandas\n",
197+
"monthly_aggregate_unmaintainable(pandas_df)\n",
198+
"\n",
199+
"# polars\n",
200+
"monthly_aggregate_unmaintainable(polars_df)\n",
201+
"\n",
202+
"# pyspark\n",
203+
"monthly_aggregate_unmaintainable(spark_df)"
204+
]
205+
},
206+
{
207+
"cell_type": "markdown",
208+
"id": "11",
209+
"metadata": {
210+
"marimo": {
211+
"config": {
212+
"hide_code": true
213+
}
214+
}
215+
},
216+
"source": [
217+
"## Best solution: Narwhals as a unified dataframe interface"
218+
]
219+
},
220+
{
221+
"cell_type": "code",
222+
"execution_count": null,
223+
"id": "12",
224+
"metadata": {},
225+
"outputs": [],
226+
"source": [
227+
"import narwhals as nw\n",
228+
"from narwhals.typing import IntoFrameT\n",
229+
"\n",
230+
"\n",
231+
"def monthly_aggregate(user_df: IntoFrameT) -> IntoFrameT:\n",
232+
" return (\n",
233+
" nw.from_native(user_df)\n",
234+
" .group_by(nw.col(\"date\").dt.truncate(\"1mo\"))\n",
235+
" .agg(nw.col(\"price\").mean())\n",
236+
" .sort(\"date\")\n",
237+
" .to_native()\n",
238+
" )\n"
239+
]
240+
},
241+
{
242+
"cell_type": "code",
243+
"execution_count": null,
244+
"id": "13",
245+
"metadata": {},
246+
"outputs": [],
247+
"source": [
248+
"# pandas\n",
249+
"monthly_aggregate(pandas_df)\n",
250+
"\n",
251+
"# polars\n",
252+
"monthly_aggregate(polars_df)\n",
253+
"\n",
254+
"# duckdb\n",
255+
"monthly_aggregate(duckdb_df)\n",
256+
"\n",
257+
"# pyarrow\n",
258+
"monthly_aggregate(arrow_table)\n",
259+
"\n",
260+
"# pyspark\n",
261+
"monthly_aggregate(spark_df)"
262+
]
263+
},
264+
{
265+
"cell_type": "markdown",
266+
"id": "14",
267+
"metadata": {
268+
"marimo": {
269+
"config": {
270+
"hide_code": true
271+
}
272+
}
273+
},
274+
"source": [
275+
"## Bonus - can we generate SQL?"
276+
]
277+
},
278+
{
279+
"cell_type": "code",
280+
"execution_count": null,
281+
"id": "15",
282+
"metadata": {},
283+
"outputs": [],
284+
"source": [
285+
"from sqlframe.duckdb import DuckDBSession\n",
286+
"\n",
287+
"sqlframe = DuckDBSession()\n",
288+
"sqlframe_df = sqlframe.createDataFrame(pandas_df)\n",
289+
"sqlframe_result = monthly_aggregate(sqlframe_df)\n",
290+
"print(sqlframe_result.sql(dialect=\"databricks\"))"
291+
]
292+
}
293+
],
294+
"metadata": {
295+
"language_info": {
296+
"name": "python"
297+
}
298+
},
299+
"nbformat": 4,
300+
"nbformat_minor": 5
301+
}

0 commit comments

Comments
 (0)