CodeCutTech
diff --git a/‎Chapter1/class.ipynb
Lines changed: 25 additions & 11 deletions b/‎Chapter1/class.ipynb
Lines changed: 25 additions & 11 deletions
diff --git a/‎Chapter5/SQL.ipynb
Lines changed: 161 additions & 1 deletion b/‎Chapter5/SQL.ipynb
Lines changed: 161 additions & 1 deletion
@@ -229,33 +229,47 @@
    ]
   },
   {
-   "attachments": {},
    "cell_type": "markdown",
-   "id": "0fd4cc9e",
+   "id": "37406782-1dc9-4960-b1d5-875113befde7",
    "metadata": {},
    "source": [
-    "An instance-level method requires instantiating a class object to operate, while a class method doesn't.\n",
+    "An instance-level method requires instantiating a class object to operate, while a class method doesn’t.\n",
     "\n",
-    "Class methods can provide alternate ways to construct objects. In the code below, the `from_csv` class method instantiates the class by reading data from a CSV file."
+    "Class methods can provide alternate ways to construct objects. In the code below, the from_csv class method instantiates the class by reading data from a CSV file."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 22,
-   "id": "69f4f781",
+   "execution_count": 1,
+   "id": "b341cc6c-a746-486c-8cd6-a7a229466331",
    "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
     "tags": [
-     "remove-cell"
+     "remove-input"
     ]
    },
    "outputs": [],
    "source": [
-    "!echo \"Name,Age,Country\" > data.csv && echo \"Alice,25,USA\" >> data.csv && echo \"Bob,30,Canada\" >> data.csv"
+    "import pandas as pd\n",
+    "\n",
+    "# Create example dataframe\n",
+    "df = pd.DataFrame({\n",
+    "    'age': [25, 34, 45],\n",
+    "    'income': [50000, 75000, 65000],\n",
+    "    'education_years': [16, 18, 14],\n",
+    "    'satisfaction_score': [8, 7, 9]\n",
+    "})\n",
+    "\n",
+    "# Save the dataframe to CSV\n",
+    "df.to_csv('data.csv', index=False)"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 26,
+   "execution_count": 3,
    "id": "53c1614d",
    "metadata": {
     "ExecuteTime": {
@@ -295,7 +309,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 27,
+   "execution_count": 4,
    "id": "0afa6c77",
    "metadata": {
     "ExecuteTime": {
@@ -308,7 +322,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Shape of data: (2, 3)\n"
+      "Shape of data: (3, 4)\n"
      ]
     }
    ],
 
@@ -997,7 +997,167 @@
    "id": "e5b77a38",
    "metadata": {},
    "source": [
-    "[Link to DuckDB](https://bit.ly/4dJxNHV)."
+    "[Link to DuckDB](https://github.com/duckdb/duckdb)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3be9869b-68e7-453d-bdba-6c678d0482d3",
+   "metadata": {},
+   "source": [
+    "### DuckDB: Query Pandas DataFrames Faster with Columnar Storage"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "59a4e2bb-2c3c-4276-a163-a98f1b15625c",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [
+     "hide-cell"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "!pip install duckdb"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "484e2a53-aae2-4849-9182-80ca95ea6026",
+   "metadata": {},
+   "source": [
+    "When analyzing data with operations like GROUP BY, SUM, or AVG on specific columns, row-based storage results in reading unnecessary data and inefficient memory usage since entire rows must be loaded even when only a few columns are needed.\n",
+    "\n",
+    "Example using SQLite (row-based):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "1171bdc1-0009-406a-be97-896eb0ed6ee5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sqlite3\n",
+    "import pandas as pd\n",
+    "\n",
+    "customer = pd.DataFrame({\n",
+    "    \"id\": [1, 2, 3],\n",
+    "    \"name\": [\"Alex\", \"Ben\", \"Chase\"],\n",
+    "    \"age\": [25, 30, 35]\n",
+    "})\n",
+    "\n",
+    "# Load data to SQLite and query\n",
+    "conn = sqlite3.connect(':memory:')\n",
+    "customer.to_sql('customer', conn, index=False)\n",
+    "\n",
+    "# Must read all columns internally even though we only need 'age'\n",
+    "query = \"SELECT age FROM customer\"\n",
+    "result = pd.read_sql(query, conn)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "56d499ac-7323-493f-859e-7921a052fdc7",
+   "metadata": {},
+   "source": [
+    "DuckDB uses columnar storage, allowing you to efficiently read and process only the columns needed for your analysis. This improves both query speed and memory usage:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "ae65be7f-0e4b-485e-b403-4a2943a24578",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>age</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>25</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>30</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>35</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "   age\n",
+       "0   25\n",
+       "1   30\n",
+       "2   35"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import duckdb\n",
+    "import pandas as pd\n",
+    "\n",
+    "customer = pd.DataFrame({\n",
+    "    \"id\": [1, 2, 3],\n",
+    "    \"name\": [\"Alex\", \"Ben\", \"Chase\"],\n",
+    "    \"age\": [25, 30, 35]\n",
+    "})\n",
+    "\n",
+    "\n",
+    "query = \"SELECT age FROM customer\"\n",
+    "result = duckdb.sql(query).df()\n",
+    "result"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ada7a04d-783d-4599-842e-4160b3cdd58d",
+   "metadata": {},
+   "source": [
+    "In this example, DuckDB only needs to access the 'age' column in memory, while SQLite must read all columns ('id', 'name', 'age') internally even though only 'age' is selected. DuckDB also provides a simpler workflow by querying pandas DataFrames directly."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1f895dd7-0a7c-471d-a6b4-94b8e5d18748",
+   "metadata": {},
+   "source": [
+    "[Link to DuckDB](https://github.com/duckdb/duckdb)."
    ]
   },
   {