Skip to content

Commit 5dc38d7

Browse files
add content
1 parent b754a31 commit 5dc38d7

File tree

13 files changed

+805
-165
lines changed

13 files changed

+805
-165
lines changed

Chapter1/dictionary.ipynb

Lines changed: 80 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -175,83 +175,112 @@
175175
"id": "5caf1b3f",
176176
"metadata": {},
177177
"source": [
178-
"### dict.get: Get the Default Value of a Dictionary if a Key Doesn't Exist"
178+
"### Stop Writing Nested if-else: Use Python's .get() Instead"
179179
]
180180
},
181181
{
182182
"cell_type": "markdown",
183-
"id": "c2dd1725",
183+
"id": "955af7b3-81c9-44d7-8b96-3113d7b3b199",
184184
"metadata": {},
185185
"source": [
186-
"If you want to get the default value when a key doesn't exist in a dictionary, use `dict.get`. In the code below, since there is no key `meeting3`, the default value `online` is returned. "
186+
"When working with dictionaries in Python, it's common to encounter situations where you need to access values that may or may not exist. The traditional approach of using multiple nested if-else statements can result in verbose, repetitive code that's harder to maintain and more prone to errors.\n",
187+
"\n",
188+
"Let's consider an example where we have a dictionary `user_data` with keys \"name\", \"age\", and possibly \"email\". We want to assign default values to these keys if they don't exist."
187189
]
188190
},
189191
{
190192
"cell_type": "code",
191-
"execution_count": 5,
192-
"id": "e066cd73",
193-
"metadata": {
194-
"ExecuteTime": {
195-
"end_time": "2021-09-30T12:49:56.062960Z",
196-
"start_time": "2021-09-30T12:49:56.056983Z"
193+
"execution_count": 2,
194+
"id": "88422abc-4db3-42fb-a655-938c2b1db0c4",
195+
"metadata": {},
196+
"outputs": [
197+
{
198+
"name": "stdout",
199+
"output_type": "stream",
200+
"text": [
201+
"name='Alice'\n",
202+
"age=30\n",
203+
"email='[email protected]'\n"
204+
]
197205
}
198-
},
199-
"outputs": [],
206+
],
200207
"source": [
201-
"locations = {'meeting1': 'room1', 'meeting2': 'room2'}"
208+
"# Checking dictionary values with multiple if-else\n",
209+
"user_data = {\"name\": \"Alice\", \"age\": 30}\n",
210+
"\n",
211+
"# Repetitive code with multiple default values\n",
212+
"if \"name\" in user_data:\n",
213+
" name = user_data[\"name\"]\n",
214+
"else:\n",
215+
" name = \"Unknown\"\n",
216+
" \n",
217+
"if \"age\" in user_data:\n",
218+
" age = user_data[\"age\"]\n",
219+
"else:\n",
220+
" age = 0\n",
221+
" \n",
222+
"if \"email\" in user_data:\n",
223+
" email = user_data[\"email\"]\n",
224+
"else:\n",
225+
" email = \"[email protected]\"\n",
226+
"\n",
227+
"\n",
228+
"print(f\"{name=}\")\n",
229+
"print(f\"{age=}\")\n",
230+
"print(f\"{email=}\")"
202231
]
203232
},
204233
{
205-
"cell_type": "code",
206-
"execution_count": 6,
207-
"id": "d07fcf3e",
208-
"metadata": {
209-
"ExecuteTime": {
210-
"end_time": "2021-09-30T12:49:59.099210Z",
211-
"start_time": "2021-09-30T12:49:59.090362Z"
212-
},
213-
"scrolled": true
214-
},
215-
"outputs": [
216-
{
217-
"data": {
218-
"text/plain": [
219-
"'room1'"
220-
]
221-
},
222-
"execution_count": 6,
223-
"metadata": {},
224-
"output_type": "execute_result"
225-
}
226-
],
234+
"cell_type": "markdown",
235+
"id": "21613f3c-afca-4d32-a158-8efd70503675",
236+
"metadata": {},
237+
"source": [
238+
"As you can see, this approach is tedious and prone to errors. "
239+
]
240+
},
241+
{
242+
"cell_type": "markdown",
243+
"id": "ea9eef2a-cd12-4439-82b7-cbfe84ccc42d",
244+
"metadata": {},
227245
"source": [
228-
"locations.get('meeting1', 'online')"
246+
"With the `.get()` method, we can access dictionary values with default values in a single line of code. This approach is not only more concise but also more readable and maintainable."
229247
]
230248
},
231249
{
232250
"cell_type": "code",
233251
"execution_count": 3,
234-
"id": "6de353f4",
235-
"metadata": {
236-
"ExecuteTime": {
237-
"end_time": "2021-09-30T12:49:46.738598Z",
238-
"start_time": "2021-09-30T12:49:46.729582Z"
239-
}
240-
},
252+
"id": "d97082d7-c441-4b7e-a011-a4f152ac92f0",
253+
"metadata": {},
241254
"outputs": [
242255
{
243-
"data": {
244-
"text/plain": [
245-
"'online'"
246-
]
247-
},
248-
"execution_count": 3,
249-
"metadata": {},
250-
"output_type": "execute_result"
256+
"name": "stdout",
257+
"output_type": "stream",
258+
"text": [
259+
"name='Alice'\n",
260+
"age=30\n",
261+
"email='[email protected]'\n"
262+
]
251263
}
252264
],
253265
"source": [
254-
"locations.get('meeting3', 'online')"
266+
"# Using .get() method for cleaner code\n",
267+
"user_data = {\"name\": \"Alice\", \"age\": 30}\n",
268+
"\n",
269+
"name = user_data.get(\"name\", \"Unknown\")\n",
270+
"age = user_data.get(\"age\", 0)\n",
271+
"email = user_data.get(\"email\", \"[email protected]\")\n",
272+
"\n",
273+
"print(f\"{name=}\")\n",
274+
"print(f\"{age=}\")\n",
275+
"print(f\"{email=}\")"
276+
]
277+
},
278+
{
279+
"cell_type": "markdown",
280+
"id": "c2dd1725",
281+
"metadata": {},
282+
"source": [
283+
"If you want to get the default value when a key doesn't exist in a dictionary, use `dict.get`. In the code below, since there is no key `meeting3`, the default value `online` is returned. "
255284
]
256285
},
257286
{

Chapter5/spark.ipynb

Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2000,6 +2000,199 @@
20002000
"SquareNumbers(lit(1), lit(3)).show()"
20012001
]
20022002
},
2003+
{
2004+
"cell_type": "markdown",
2005+
"id": "216e9fc0-12a9-4f45-85b1-8e791755b1d3",
2006+
"metadata": {},
2007+
"source": [
2008+
"### Best Practices for PySpark DataFrame Comparison Testing"
2009+
]
2010+
},
2011+
{
2012+
"cell_type": "code",
2013+
"execution_count": null,
2014+
"id": "9badb0ee-16ec-4291-9477-8a38ebd7e876",
2015+
"metadata": {
2016+
"editable": true,
2017+
"slideshow": {
2018+
"slide_type": ""
2019+
},
2020+
"tags": [
2021+
"hide-cell"
2022+
]
2023+
},
2024+
"outputs": [],
2025+
"source": [
2026+
"!pip install \"pyspark[sql]\""
2027+
]
2028+
},
2029+
{
2030+
"cell_type": "code",
2031+
"execution_count": 19,
2032+
"id": "d2adcd65-5197-404f-88d6-c368a863cf75",
2033+
"metadata": {
2034+
"editable": true,
2035+
"slideshow": {
2036+
"slide_type": ""
2037+
},
2038+
"tags": [
2039+
"hide-cell"
2040+
]
2041+
},
2042+
"outputs": [],
2043+
"source": [
2044+
"from pyspark.sql import SparkSession\n",
2045+
"\n",
2046+
"# Create SparkSession\n",
2047+
"spark = SparkSession.builder.getOrCreate()"
2048+
]
2049+
},
2050+
{
2051+
"cell_type": "markdown",
2052+
"id": "1002536a",
2053+
"metadata": {},
2054+
"source": [
2055+
"Manually comparing PySpark DataFrame outputs using `collect()` and equality comparison leads to brittle tests due to ordering issues and unclear error messages when data doesn't match expectations.\n",
2056+
"\n",
2057+
"For example, the following test will fail due to ordering issues, resulting in an unclear error message.\n"
2058+
]
2059+
},
2060+
{
2061+
"cell_type": "code",
2062+
"execution_count": 31,
2063+
"id": "e4299f30",
2064+
"metadata": {},
2065+
"outputs": [
2066+
{
2067+
"name": "stderr",
2068+
"output_type": "stream",
2069+
"text": [
2070+
" \r"
2071+
]
2072+
},
2073+
{
2074+
"name": "stdout",
2075+
"output_type": "stream",
2076+
"text": [
2077+
"assert [Row(id=1, name='Alice', value=100), Row(id=2, name='Bob', value=200)] == [Row(id=2, name='Bob', value=200), Row(id=1, name='Alice', value=100)]\n",
2078+
" + where [Row(id=1, name='Alice', value=100), Row(id=2, name='Bob', value=200)] = <bound method DataFrame.collect of DataFrame[id: bigint, name: string, value: bigint]>()\n",
2079+
" + where <bound method DataFrame.collect of DataFrame[id: bigint, name: string, value: bigint]> = DataFrame[id: bigint, name: string, value: bigint].collect\n",
2080+
" + and [Row(id=2, name='Bob', value=200), Row(id=1, name='Alice', value=100)] = <bound method DataFrame.collect of DataFrame[id: bigint, name: string, value: bigint]>()\n",
2081+
" + where <bound method DataFrame.collect of DataFrame[id: bigint, name: string, value: bigint]> = DataFrame[id: bigint, name: string, value: bigint].collect\n"
2082+
]
2083+
}
2084+
],
2085+
"source": [
2086+
"# Manual DataFrame comparison\n",
2087+
"result_df = spark.createDataFrame(\n",
2088+
" [(1, \"Alice\", 100), (2, \"Bob\", 200)], [\"id\", \"name\", \"value\"]\n",
2089+
")\n",
2090+
"\n",
2091+
"expected_df = spark.createDataFrame(\n",
2092+
" [(2, \"Bob\", 200), (1, \"Alice\", 100)], [\"id\", \"name\", \"value\"]\n",
2093+
")\n",
2094+
"\n",
2095+
"try:\n",
2096+
" assert result_df.collect() == expected_df.collect()\n",
2097+
"except AssertionError as e:\n",
2098+
" print(e)"
2099+
]
2100+
},
2101+
{
2102+
"cell_type": "markdown",
2103+
"id": "7c4f8fd8-c2c2-4804-8e42-6fd3eb6aec27",
2104+
"metadata": {},
2105+
"source": [
2106+
"`assertDataFrameEqual` provides a robust way to compare DataFrames, allowing for order-independent comparison.\n"
2107+
]
2108+
},
2109+
{
2110+
"cell_type": "code",
2111+
"execution_count": null,
2112+
"id": "73b0d483-8b00-44ab-9279-4c7765ca1ff6",
2113+
"metadata": {},
2114+
"outputs": [],
2115+
"source": [
2116+
"# Testing with DataFrame equality\n",
2117+
"from pyspark.testing.utils import assertDataFrameEqual"
2118+
]
2119+
},
2120+
{
2121+
"cell_type": "code",
2122+
"execution_count": 7,
2123+
"id": "7c46ae8a",
2124+
"metadata": {},
2125+
"outputs": [],
2126+
"source": [
2127+
"assertDataFrameEqual(result_df, expected_df)"
2128+
]
2129+
},
2130+
{
2131+
"cell_type": "markdown",
2132+
"id": "085f150d-20ff-4b0a-a4ab-1ee452598e9e",
2133+
"metadata": {},
2134+
"source": [
2135+
"Using `collect()` for comparison cannot detect type mismatch, whereas `assertDataFrameEqual` can.\n",
2136+
"\n",
2137+
"For example, the following test will pass, even though there is a type mismatch.\n"
2138+
]
2139+
},
2140+
{
2141+
"cell_type": "code",
2142+
"execution_count": 27,
2143+
"id": "da7494c0-c05f-4a2f-a411-805c8f2f73ba",
2144+
"metadata": {},
2145+
"outputs": [],
2146+
"source": [
2147+
"# Manual DataFrame comparison\n",
2148+
"result_df = spark.createDataFrame(\n",
2149+
" [(1, \"Alice\", 100), (2, \"Bob\", 200)], [\"id\", \"name\", \"value\"]\n",
2150+
")\n",
2151+
"\n",
2152+
"expected_df = spark.createDataFrame(\n",
2153+
" [(1, \"Alice\", 100.0), (2, \"Bob\", 200.0)], [\"id\", \"name\", \"value\"]\n",
2154+
")\n",
2155+
"\n",
2156+
"assert result_df.collect() == expected_df.collect()"
2157+
]
2158+
},
2159+
{
2160+
"cell_type": "markdown",
2161+
"id": "82914b3b-69c0-4c68-9d72-d2ce31417397",
2162+
"metadata": {},
2163+
"source": [
2164+
"The error message produced by `assertDataFrameEqual` is clear and informative, highlighting the difference in schemas."
2165+
]
2166+
},
2167+
{
2168+
"cell_type": "code",
2169+
"execution_count": 30,
2170+
"id": "3faa1dbc-887a-4c36-ace8-c621411c3fb7",
2171+
"metadata": {},
2172+
"outputs": [
2173+
{
2174+
"name": "stdout",
2175+
"output_type": "stream",
2176+
"text": [
2177+
"[DIFFERENT_SCHEMA] Schemas do not match.\n",
2178+
"--- actual\n",
2179+
"+++ expected\n",
2180+
"- StructType([StructField('id', LongType(), True), StructField('name', StringType(), True), StructField('value', LongType(), True)])\n",
2181+
"? ^ ^^\n",
2182+
"\n",
2183+
"+ StructType([StructField('id', LongType(), True), StructField('name', StringType(), True), StructField('value', DoubleType(), True)])\n",
2184+
"? ^ ^^^^\n",
2185+
"\n"
2186+
]
2187+
}
2188+
],
2189+
"source": [
2190+
"try:\n",
2191+
" assertDataFrameEqual(result_df, expected_df)\n",
2192+
"except AssertionError as e:\n",
2193+
" print(e)"
2194+
]
2195+
},
20032196
{
20042197
"cell_type": "markdown",
20052198
"id": "9da7e800",

docs/Chapter1/Chapter1.html

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -272,7 +272,24 @@
272272
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/testing.html">6.13. Testing</a></li>
273273
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/SQL.html">6.14. SQL Libraries</a></li>
274274
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/spark.html">6.15. 3 Powerful Ways to Create PySpark DataFrames</a></li>
275-
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/llm.html">6.16. Large Language Model (LLM)</a></li>
275+
276+
277+
278+
279+
280+
281+
282+
283+
284+
285+
286+
287+
288+
289+
290+
291+
292+
<li class="toctree-l2"><a class="reference internal" href="../Chapter5/llm.html">6.33. Large Language Model (LLM)</a></li>
276293
</ul>
277294
</li>
278295
<li class="toctree-l1 has-children"><a class="reference internal" href="../Chapter6/Chapter6.html">7. Cool Tools</a><input class="toctree-checkbox" id="toctree-checkbox-7" name="toctree-checkbox-7" type="checkbox"/><label class="toctree-toggle" for="toctree-checkbox-7"><i class="fa-solid fa-chevron-down"></i></label><ul>

0 commit comments

Comments
 (0)