Skip to content

Commit 4b77b2f

Browse files
committed
finish stats notebook
1 parent 4d9a2d7 commit 4b77b2f

File tree

1 file changed

+79
-17
lines changed

1 file changed

+79
-17
lines changed

05.stats/stats.ipynb

Lines changed: 79 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@
110110
"id": "573e3156-8039-41bf-a919-295316648c8b",
111111
"metadata": {},
112112
"source": [
113-
"on its own, "
113+
"On its own, this output isn't all that readable - the summary statistics are put into individual columns, which means that the table is very wide. Let's see how we can re-arrange this so that we have a `dict()` of **DataFrames**, one for each station. First, we'll assign the output of `.describe()` to a variable, `group_summary`:"
114114
]
115115
},
116116
{
@@ -123,13 +123,53 @@
123123
"group_summary = station_data.groupby('station').describe()"
124124
]
125125
},
126+
{
127+
"cell_type": "markdown",
128+
"id": "513dceb6-9d93-4f99-bba0-88eb94c4ff6c",
129+
"metadata": {},
130+
"source": [
131+
"Next, we'll iterate over the stations to work with each row of the table in turn. First, though, let's look at how the column names are organized:"
132+
]
133+
},
126134
{
127135
"cell_type": "code",
128136
"execution_count": null,
129-
"id": "d48684a5-f474-460f-bf06-a50c64c9b20a",
137+
"id": "785b5b9c-1999-4002-bfce-4ed298dd814a",
130138
"metadata": {},
131139
"outputs": [],
132-
"source": []
140+
"source": [
141+
"group_summary.columns # show the column names"
142+
]
143+
},
144+
{
145+
"cell_type": "markdown",
146+
"id": "6b36eedd-cadb-4fd2-b64e-f97975760769",
147+
"metadata": {},
148+
"source": [
149+
"This is an example of a **MultiIndex** ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html)) - a multi-level index object, similar to what we have seen previously for rows. Before beginning the `for` loop below, we use `columns.unique()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.Index.unique.html)) to get the unique first-level names from the columns (i.e., the variable names from the original **DataFrame**). \n",
150+
"\n",
151+
"Inside of the `for` loop, we first select the row corresponding to each station using `.loc`. Have a look at this line:\n",
152+
"\n",
153+
"```python\n",
154+
"reshaped = pd.concat([this_summary[ind] for ind in columns], axis=1)\n",
155+
"\n",
156+
"```\n",
157+
"\n",
158+
"This uses something called [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) to quickly create a list of objects. It is effectively the same as writing something like:\n",
159+
"\n",
160+
"```python\n",
161+
"out_list = []\n",
162+
"for ind in columns:\n",
163+
" out_list.append(this_summary[ind])\n",
164+
"\n",
165+
"reshaped = pd.concat(out_list, axis=1)\n",
166+
"\n",
167+
"```\n",
168+
"\n",
169+
"Using list comprehension helps make the code more concise and readable - it's a very handy tool for creating lists with iteration. In addition to list comprehension, python also has [dict comprehension](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) - we won't use this here, but it works in a very similar way to list comprehension.\n",
170+
"\n",
171+
"Once we have reshaped the row (the **Series**) into a **DataFrame**, we assign the column names, before using `.append()` to add the reshaped table to a **list**:"
172+
]
133173
},
134174
{
135175
"cell_type": "code",
@@ -139,25 +179,43 @@
139179
"outputs": [],
140180
"source": [
141181
"stations = group_summary.index.unique() # get the unique values of station from the table\n",
182+
"columns = group_summary.columns.unique(level=0) # get the unique names of the columns from the first level (level 0)\n",
142183
"\n",
143184
"combined_stats = [] # initialize an empty list\n",
144185
"\n",
145186
"for station in stations:\n",
146187
" this_summary = group_summary.loc[station] # get the row corresponding to this station\n",
147-
" columns = this_summary.index.unique(level=0) # get the unique variable names from the multi-index\n",
148188
" \n",
149189
" reshaped = pd.concat([this_summary[ind] for ind in columns], axis=1) # use list comprehension to reshape the table\n",
150190
" reshaped.columns = columns # set the column names\n",
151-
" combined_stats.append(reshaped) # add the reshaped table to the list\n",
152-
"\n",
191+
" combined_stats.append(reshaped) # add the reshaped table to the list"
192+
]
193+
},
194+
{
195+
"cell_type": "markdown",
196+
"id": "815551ca-e1f6-4981-a473-768cd2abbf96",
197+
"metadata": {},
198+
"source": [
199+
"Finally, we'll use the built-in function `zip()` to get pairs of station names (from `station`) and **DataFrame**s (from `combined_stats`), then pass this to `dict()` to create a dictionary of station name/**DataFrame** key/value pairs:"
200+
]
201+
},
202+
{
203+
"cell_type": "code",
204+
"execution_count": null,
205+
"id": "77ccd5f4-bc76-4eeb-bb97-7a2e43f28991",
206+
"metadata": {},
207+
"outputs": [],
208+
"source": [
153209
"summary_dict = dict(zip(stations, combined_stats)) # create a dict of station name, dataframe pairs"
154210
]
155211
},
156212
{
157213
"cell_type": "markdown",
158214
"id": "3a8c44d7-d040-43c5-83a7-7d4b6e99d014",
159215
"metadata": {},
160-
"source": []
216+
"source": [
217+
"To check that this worked, let's look at the summary data for Oxford:"
218+
]
161219
},
162220
{
163221
"cell_type": "code",
@@ -176,7 +234,7 @@
176234
"source": [
177235
"### using built-in functions for descriptive statistics\n",
178236
"\n",
179-
"This is helpful, but sometimes we want to calculate other descriptive statistics, or use the values of descriptive statistics in our code. `pandas` has a number of built-in functions for this - we have already seen `.mean()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html)), for calculating the arithmetic mean of a **DataFrame**:"
237+
"This is helpful, but sometimes we want to calculate other descriptive statistics, or use the values of descriptive statistics in our code. `pandas` has a number of built-in functions for this - we have already seen `.mean()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html)), for calculating the arithmetic mean of each column of a **DataFrame**:"
180238
]
181239
},
182240
{
@@ -212,7 +270,7 @@
212270
"id": "e87f5186-0c0e-4949-b88f-065b0a135217",
213271
"metadata": {},
214272
"source": [
215-
"`.median()`"
273+
"We can calculate the median of the columns of a **DataFrame** (or a **Series**) using `.median()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html)): "
216274
]
217275
},
218276
{
@@ -222,15 +280,15 @@
222280
"metadata": {},
223281
"outputs": [],
224282
"source": [
225-
"station_data.mean(numeric_only=True)"
283+
"station_data.median(numeric_only=True) # calculate the median of each numeric column"
226284
]
227285
},
228286
{
229287
"cell_type": "markdown",
230288
"id": "35dbde86-4d81-4634-9a68-44c421eda65e",
231289
"metadata": {},
232290
"source": [
233-
"`.var()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.var.html)):"
291+
"To calculate the variance of the columns of a **DataFrame** (or a **Series**), use `.var()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.var.html)):"
234292
]
235293
},
236294
{
@@ -248,7 +306,7 @@
248306
"id": "c97bcbb4-2c2e-4c36-82a0-3541f354132e",
249307
"metadata": {},
250308
"source": [
251-
"`.std()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html)):"
309+
"and for the standard deviation, `.std()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html)):"
252310
]
253311
},
254312
{
@@ -266,7 +324,7 @@
266324
"id": "8ff8c916-0857-41b6-be4f-aa8c87b599d4",
267325
"metadata": {},
268326
"source": [
269-
"`.quantile()` ([documentation]()):"
327+
"`pandas` doesn't have a built-in function for the inter-quartile range (IQR), but we can easily calculate it ourselves using `.quantile()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html)) to calculte the 3rd quantile and the 1st quantile and subtracting the outputs:"
270328
]
271329
},
272330
{
@@ -284,7 +342,7 @@
284342
"id": "4ab70444-f7bd-4651-8a21-51579a7b0195",
285343
"metadata": {},
286344
"source": [
287-
"`.sum()` "
345+
"Finally, we can also calculate the sum of each column of a **DataFrame** (or a **Series**) using `.sum()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html)):"
288346
]
289347
},
290348
{
@@ -302,7 +360,11 @@
302360
"id": "3da222bf-46aa-478f-af6f-da63128411ee",
303361
"metadata": {},
304362
"source": [
305-
"### with .groupby()"
363+
"These are far from the only methods available, but they are some of the most common. For a full list, check the `pandas` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) under **Methods**.\n",
364+
"\n",
365+
"### with .groupby()\n",
366+
"\n",
367+
"As we have seen, the output of `.groupby()` is a special type of **DataFrame**, and it inherits almost all of the methods for calculating summary statistics:"
306368
]
307369
},
308370
{
@@ -358,8 +420,8 @@
358420
"metadata": {},
359421
"outputs": [],
360422
"source": [
361-
"armagh_rain = selected.loc[selected['station'] == 'armagh', 'rain'].dropna().sample(n=50) # take a sample of 30 rain observations from armagh\n",
362-
"stornoway_rain = selected.loc[selected['station'] == 'stornoway', 'rain'].dropna().sample(n=50) # take a sample of 30 rain observations from stornoway\n",
423+
"armagh_rain = selected.loc[selected['station'] == 'armagh', 'rain'].dropna().sample(n=30) # take a sample of 30 rain observations from armagh\n",
424+
"stornoway_rain = selected.loc[selected['station'] == 'stornoway', 'rain'].dropna().sample(n=30) # take a sample of 30 rain observations from stornoway\n",
363425
"\n",
364426
"# test whether stornoway_rain.mean() > armagh_rain.mean() at the 99% confidence level\n",
365427
"rain_comp = pg.ttest(stornoway_rain, armagh_rain, alternative='greater', confidence=0.99)\n",

0 commit comments

Comments
 (0)