finish stats notebook

iamdonovan · iamdonovan · commit 4b77b2f57915 · 2023-10-10T10:01:43.000+01:00
diff --git a/05.stats/stats.ipynb b/05.stats/stats.ipynb
@@ -110,7 +110,7 @@
    "id": "573e3156-8039-41bf-a919-295316648c8b",
    "metadata": {},
    "source": [
-    "on its own, "
+    "On its own, this output isn't all that readable - the summary statistics are put into individual columns, which means that the table is very wide. Let's see how we can re-arrange this so that we have a `dict()` of **DataFrames**, one for each station. First, we'll assign the output of `.describe()` to a variable, `group_summary`:"
    ]
   },
   {
@@ -123,13 +123,53 @@
     "group_summary = station_data.groupby('station').describe()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "513dceb6-9d93-4f99-bba0-88eb94c4ff6c",
+   "metadata": {},
+   "source": [
+    "Next, we'll iterate over the stations to work with each row of the table in turn. First, though, let's look at how the column names are organized:"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "d48684a5-f474-460f-bf06-a50c64c9b20a",
+   "id": "785b5b9c-1999-4002-bfce-4ed298dd814a",
    "metadata": {},
    "outputs": [],
-   "source": []
+   "source": [
+    "group_summary.columns # show the column names"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6b36eedd-cadb-4fd2-b64e-f97975760769",
+   "metadata": {},
+   "source": [
+    "This is an example of a **MultiIndex** ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html)) - a multi-level index object, similar to what we have seen previously for rows. Before beginning the `for` loop below, we use `columns.unique()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.Index.unique.html)) to get the unique first-level names from the columns (i.e., the variable names from the original **DataFrame**). \n",
+    "\n",
+    "Inside of the `for` loop, we first select the row corresponding to each station using `.loc`. Have a look at this line:\n",
+    "\n",
+    "```python\n",
+    "reshaped = pd.concat([this_summary[ind] for ind in columns], axis=1)\n",
+    "\n",
+    "```\n",
+    "\n",
+    "This uses something called [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) to quickly create a list of objects. It is effectively the same as writing something like:\n",
+    "\n",
+    "```python\n",
+    "out_list = []\n",
+    "for ind in columns:\n",
+    "    out_list.append(this_summary[ind])\n",
+    "\n",
+    "reshaped = pd.concat(out_list, axis=1)\n",
+    "\n",
+    "```\n",
+    "\n",
+    "Using list comprehension helps make the code more concise and readable - it's a very handy tool for creating lists with iteration. In addition to list comprehension, python also has [dict comprehension](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) - we won't use this here, but it works in a very similar way to list comprehension.\n",
+    "\n",
+    "Once we have reshaped the row (the **Series**) into a **DataFrame**, we assign the column names, before using `.append()` to add the reshaped table to a **list**:"
+   ]
   },
   {
    "cell_type": "code",
@@ -139,25 +179,43 @@
    "outputs": [],
    "source": [
     "stations = group_summary.index.unique() # get the unique values of station from the table\n",
+    "columns = group_summary.columns.unique(level=0) # get the unique names of the columns from the first level (level 0)\n",
     "\n",
     "combined_stats = [] # initialize an empty list\n",
     "\n",
     "for station in stations:\n",
     "    this_summary = group_summary.loc[station] # get the row corresponding to this station\n",
-    "    columns = this_summary.index.unique(level=0) # get the unique variable names from the multi-index\n",
     "    \n",
     "    reshaped = pd.concat([this_summary[ind] for ind in columns], axis=1) # use list comprehension to reshape the table\n",
     "    reshaped.columns = columns # set the column names\n",
-    "    combined_stats.append(reshaped) # add the reshaped table to the list\n",
-    "\n",
+    "    combined_stats.append(reshaped) # add the reshaped table to the list"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "815551ca-e1f6-4981-a473-768cd2abbf96",
+   "metadata": {},
+   "source": [
+    "Finally, we'll use the built-in function `zip()` to get pairs of station names (from `station`) and **DataFrame**s (from `combined_stats`), then pass this to `dict()` to create a dictionary of station name/**DataFrame** key/value pairs:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "77ccd5f4-bc76-4eeb-bb97-7a2e43f28991",
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "summary_dict = dict(zip(stations, combined_stats)) # create a dict of station name, dataframe pairs"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "3a8c44d7-d040-43c5-83a7-7d4b6e99d014",
    "metadata": {},
-   "source": []
+   "source": [
+    "To check that this worked, let's look at the summary data for Oxford:"
+   ]
   },
   {
    "cell_type": "code",
@@ -176,7 +234,7 @@
    "source": [
     "### using built-in functions for descriptive statistics\n",
     "\n",
-    "This is helpful, but sometimes we want to calculate other descriptive statistics, or use the values of descriptive statistics in our code. `pandas` has a number of built-in functions for this - we have already seen `.mean()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html)), for calculating the arithmetic mean of a **DataFrame**:"
+    "This is helpful, but sometimes we want to calculate other descriptive statistics, or use the values of descriptive statistics in our code. `pandas` has a number of built-in functions for this - we have already seen `.mean()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html)), for calculating the arithmetic mean of each column of a **DataFrame**:"
    ]
   },
   {
@@ -212,7 +270,7 @@
    "id": "e87f5186-0c0e-4949-b88f-065b0a135217",
    "metadata": {},
    "source": [
-    "`.median()`"
+    "We can calculate the median of the columns of a **DataFrame** (or a **Series**) using `.median()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html)): "
    ]
   },
   {
@@ -222,15 +280,15 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "station_data.mean(numeric_only=True)"
+    "station_data.median(numeric_only=True) # calculate the median of each numeric column"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "35dbde86-4d81-4634-9a68-44c421eda65e",
    "metadata": {},
    "source": [
-    "`.var()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.var.html)):"
+    "To calculate the variance of the columns of a **DataFrame** (or a **Series**), use `.var()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.var.html)):"
    ]
   },
   {
@@ -248,7 +306,7 @@
    "id": "c97bcbb4-2c2e-4c36-82a0-3541f354132e",
    "metadata": {},
    "source": [
-    "`.std()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html)):"
+    "and for the standard deviation, `.std()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html)):"
    ]
   },
   {
@@ -266,7 +324,7 @@
    "id": "8ff8c916-0857-41b6-be4f-aa8c87b599d4",
    "metadata": {},
    "source": [
-    "`.quantile()` ([documentation]()):"
+    "`pandas` doesn't have a built-in function for the inter-quartile range (IQR), but we can easily calculate it ourselves using `.quantile()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html)) to calculte the 3rd quantile and the 1st quantile and subtracting the outputs:"
    ]
   },
   {
@@ -284,7 +342,7 @@
    "id": "4ab70444-f7bd-4651-8a21-51579a7b0195",
    "metadata": {},
    "source": [
-    "`.sum()` "
+    "Finally, we can also calculate the sum of each column of a **DataFrame** (or a **Series**) using `.sum()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html)):"
    ]
   },
   {
@@ -302,7 +360,11 @@
    "id": "3da222bf-46aa-478f-af6f-da63128411ee",
    "metadata": {},
    "source": [
-    "### with .groupby()"
+    "These are far from the only methods available, but they are some of the most common. For a full list, check the `pandas` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) under **Methods**.\n",
+    "\n",
+    "### with .groupby()\n",
+    "\n",
+    "As we have seen, the output of `.groupby()` is a special type of **DataFrame**, and it inherits almost all of the methods for calculating summary statistics:"
    ]
   },
   {
@@ -358,8 +420,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "armagh_rain = selected.loc[selected['station'] == 'armagh', 'rain'].dropna().sample(n=50) # take a sample of 30 rain observations from armagh\n",
-    "stornoway_rain = selected.loc[selected['station'] == 'stornoway', 'rain'].dropna().sample(n=50) # take a sample of 30 rain observations from stornoway\n",
+    "armagh_rain = selected.loc[selected['station'] == 'armagh', 'rain'].dropna().sample(n=30) # take a sample of 30 rain observations from armagh\n",
+    "stornoway_rain = selected.loc[selected['station'] == 'stornoway', 'rain'].dropna().sample(n=30) # take a sample of 30 rain observations from stornoway\n",
     "\n",
     "# test whether stornoway_rain.mean() > armagh_rain.mean() at the 99% confidence level\n",
     "rain_comp = pg.ttest(stornoway_rain, armagh_rain, alternative='greater', confidence=0.99)\n",