|
110 | 110 | "id": "573e3156-8039-41bf-a919-295316648c8b",
|
111 | 111 | "metadata": {},
|
112 | 112 | "source": [
|
113 |
| - "on its own, " |
| 113 | + "On its own, this output isn't all that readable - the summary statistics are put into individual columns, which means that the table is very wide. Let's see how we can re-arrange this so that we have a `dict()` of **DataFrames**, one for each station. First, we'll assign the output of `.describe()` to a variable, `group_summary`:" |
114 | 114 | ]
|
115 | 115 | },
|
116 | 116 | {
|
|
123 | 123 | "group_summary = station_data.groupby('station').describe()"
|
124 | 124 | ]
|
125 | 125 | },
|
| 126 | + { |
| 127 | + "cell_type": "markdown", |
| 128 | + "id": "513dceb6-9d93-4f99-bba0-88eb94c4ff6c", |
| 129 | + "metadata": {}, |
| 130 | + "source": [ |
| 131 | + "Next, we'll iterate over the stations to work with each row of the table in turn. First, though, let's look at how the column names are organized:" |
| 132 | + ] |
| 133 | + }, |
126 | 134 | {
|
127 | 135 | "cell_type": "code",
|
128 | 136 | "execution_count": null,
|
129 |
| - "id": "d48684a5-f474-460f-bf06-a50c64c9b20a", |
| 137 | + "id": "785b5b9c-1999-4002-bfce-4ed298dd814a", |
130 | 138 | "metadata": {},
|
131 | 139 | "outputs": [],
|
132 |
| - "source": [] |
| 140 | + "source": [ |
| 141 | + "group_summary.columns # show the column names" |
| 142 | + ] |
| 143 | + }, |
| 144 | + { |
| 145 | + "cell_type": "markdown", |
| 146 | + "id": "6b36eedd-cadb-4fd2-b64e-f97975760769", |
| 147 | + "metadata": {}, |
| 148 | + "source": [ |
| 149 | + "This is an example of a **MultiIndex** ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html)) - a multi-level index object, similar to what we have seen previously for rows. Before beginning the `for` loop below, we use `columns.unique()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.Index.unique.html)) to get the unique first-level names from the columns (i.e., the variable names from the original **DataFrame**). \n", |
| 150 | + "\n", |
| 151 | + "Inside of the `for` loop, we first select the row corresponding to each station using `.loc`. Have a look at this line:\n", |
| 152 | + "\n", |
| 153 | + "```python\n", |
| 154 | + "reshaped = pd.concat([this_summary[ind] for ind in columns], axis=1)\n", |
| 155 | + "\n", |
| 156 | + "```\n", |
| 157 | + "\n", |
| 158 | + "This uses something called [list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) to quickly create a list of objects. It is effectively the same as writing something like:\n", |
| 159 | + "\n", |
| 160 | + "```python\n", |
| 161 | + "out_list = []\n", |
| 162 | + "for ind in columns:\n", |
| 163 | + " out_list.append(this_summary[ind])\n", |
| 164 | + "\n", |
| 165 | + "reshaped = pd.concat(out_list, axis=1)\n", |
| 166 | + "\n", |
| 167 | + "```\n", |
| 168 | + "\n", |
| 169 | + "Using list comprehension helps make the code more concise and readable - it's a very handy tool for creating lists with iteration. In addition to list comprehension, python also has [dict comprehension](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) - we won't use this here, but it works in a very similar way to list comprehension.\n", |
| 170 | + "\n", |
| 171 | + "Once we have reshaped the row (the **Series**) into a **DataFrame**, we assign the column names, before using `.append()` to add the reshaped table to a **list**:" |
| 172 | + ] |
133 | 173 | },
|
134 | 174 | {
|
135 | 175 | "cell_type": "code",
|
|
139 | 179 | "outputs": [],
|
140 | 180 | "source": [
|
141 | 181 | "stations = group_summary.index.unique() # get the unique values of station from the table\n",
|
| 182 | + "columns = group_summary.columns.unique(level=0) # get the unique names of the columns from the first level (level 0)\n", |
142 | 183 | "\n",
|
143 | 184 | "combined_stats = [] # initialize an empty list\n",
|
144 | 185 | "\n",
|
145 | 186 | "for station in stations:\n",
|
146 | 187 | " this_summary = group_summary.loc[station] # get the row corresponding to this station\n",
|
147 |
| - " columns = this_summary.index.unique(level=0) # get the unique variable names from the multi-index\n", |
148 | 188 | " \n",
|
149 | 189 | " reshaped = pd.concat([this_summary[ind] for ind in columns], axis=1) # use list comprehension to reshape the table\n",
|
150 | 190 | " reshaped.columns = columns # set the column names\n",
|
151 |
| - " combined_stats.append(reshaped) # add the reshaped table to the list\n", |
152 |
| - "\n", |
| 191 | + " combined_stats.append(reshaped) # add the reshaped table to the list" |
| 192 | + ] |
| 193 | + }, |
| 194 | + { |
| 195 | + "cell_type": "markdown", |
| 196 | + "id": "815551ca-e1f6-4981-a473-768cd2abbf96", |
| 197 | + "metadata": {}, |
| 198 | + "source": [ |
| 199 | + "Finally, we'll use the built-in function `zip()` to get pairs of station names (from `station`) and **DataFrame**s (from `combined_stats`), then pass this to `dict()` to create a dictionary of station name/**DataFrame** key/value pairs:" |
| 200 | + ] |
| 201 | + }, |
| 202 | + { |
| 203 | + "cell_type": "code", |
| 204 | + "execution_count": null, |
| 205 | + "id": "77ccd5f4-bc76-4eeb-bb97-7a2e43f28991", |
| 206 | + "metadata": {}, |
| 207 | + "outputs": [], |
| 208 | + "source": [ |
153 | 209 | "summary_dict = dict(zip(stations, combined_stats)) # create a dict of station name, dataframe pairs"
|
154 | 210 | ]
|
155 | 211 | },
|
156 | 212 | {
|
157 | 213 | "cell_type": "markdown",
|
158 | 214 | "id": "3a8c44d7-d040-43c5-83a7-7d4b6e99d014",
|
159 | 215 | "metadata": {},
|
160 |
| - "source": [] |
| 216 | + "source": [ |
| 217 | + "To check that this worked, let's look at the summary data for Oxford:" |
| 218 | + ] |
161 | 219 | },
|
162 | 220 | {
|
163 | 221 | "cell_type": "code",
|
|
176 | 234 | "source": [
|
177 | 235 | "### using built-in functions for descriptive statistics\n",
|
178 | 236 | "\n",
|
179 |
| - "This is helpful, but sometimes we want to calculate other descriptive statistics, or use the values of descriptive statistics in our code. `pandas` has a number of built-in functions for this - we have already seen `.mean()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html)), for calculating the arithmetic mean of a **DataFrame**:" |
| 237 | + "This is helpful, but sometimes we want to calculate other descriptive statistics, or use the values of descriptive statistics in our code. `pandas` has a number of built-in functions for this - we have already seen `.mean()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html)), for calculating the arithmetic mean of each column of a **DataFrame**:" |
180 | 238 | ]
|
181 | 239 | },
|
182 | 240 | {
|
|
212 | 270 | "id": "e87f5186-0c0e-4949-b88f-065b0a135217",
|
213 | 271 | "metadata": {},
|
214 | 272 | "source": [
|
215 |
| - "`.median()`" |
| 273 | + "We can calculate the median of the columns of a **DataFrame** (or a **Series**) using `.median()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html)): " |
216 | 274 | ]
|
217 | 275 | },
|
218 | 276 | {
|
|
222 | 280 | "metadata": {},
|
223 | 281 | "outputs": [],
|
224 | 282 | "source": [
|
225 |
| - "station_data.mean(numeric_only=True)" |
| 283 | + "station_data.median(numeric_only=True) # calculate the median of each numeric column" |
226 | 284 | ]
|
227 | 285 | },
|
228 | 286 | {
|
229 | 287 | "cell_type": "markdown",
|
230 | 288 | "id": "35dbde86-4d81-4634-9a68-44c421eda65e",
|
231 | 289 | "metadata": {},
|
232 | 290 | "source": [
|
233 |
| - "`.var()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.var.html)):" |
| 291 | + "To calculate the variance of the columns of a **DataFrame** (or a **Series**), use `.var()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.var.html)):" |
234 | 292 | ]
|
235 | 293 | },
|
236 | 294 | {
|
|
248 | 306 | "id": "c97bcbb4-2c2e-4c36-82a0-3541f354132e",
|
249 | 307 | "metadata": {},
|
250 | 308 | "source": [
|
251 |
| - "`.std()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html)):" |
| 309 | + "and for the standard deviation, `.std()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html)):" |
252 | 310 | ]
|
253 | 311 | },
|
254 | 312 | {
|
|
266 | 324 | "id": "8ff8c916-0857-41b6-be4f-aa8c87b599d4",
|
267 | 325 | "metadata": {},
|
268 | 326 | "source": [
|
269 |
| - "`.quantile()` ([documentation]()):" |
| 327 | + "`pandas` doesn't have a built-in function for the inter-quartile range (IQR), but we can easily calculate it ourselves using `.quantile()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html)) to calculte the 3rd quantile and the 1st quantile and subtracting the outputs:" |
270 | 328 | ]
|
271 | 329 | },
|
272 | 330 | {
|
|
284 | 342 | "id": "4ab70444-f7bd-4651-8a21-51579a7b0195",
|
285 | 343 | "metadata": {},
|
286 | 344 | "source": [
|
287 |
| - "`.sum()` " |
| 345 | + "Finally, we can also calculate the sum of each column of a **DataFrame** (or a **Series**) using `.sum()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html)):" |
288 | 346 | ]
|
289 | 347 | },
|
290 | 348 | {
|
|
302 | 360 | "id": "3da222bf-46aa-478f-af6f-da63128411ee",
|
303 | 361 | "metadata": {},
|
304 | 362 | "source": [
|
305 |
| - "### with .groupby()" |
| 363 | + "These are far from the only methods available, but they are some of the most common. For a full list, check the `pandas` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) under **Methods**.\n", |
| 364 | + "\n", |
| 365 | + "### with .groupby()\n", |
| 366 | + "\n", |
| 367 | + "As we have seen, the output of `.groupby()` is a special type of **DataFrame**, and it inherits almost all of the methods for calculating summary statistics:" |
306 | 368 | ]
|
307 | 369 | },
|
308 | 370 | {
|
|
358 | 420 | "metadata": {},
|
359 | 421 | "outputs": [],
|
360 | 422 | "source": [
|
361 |
| - "armagh_rain = selected.loc[selected['station'] == 'armagh', 'rain'].dropna().sample(n=50) # take a sample of 30 rain observations from armagh\n", |
362 |
| - "stornoway_rain = selected.loc[selected['station'] == 'stornoway', 'rain'].dropna().sample(n=50) # take a sample of 30 rain observations from stornoway\n", |
| 423 | + "armagh_rain = selected.loc[selected['station'] == 'armagh', 'rain'].dropna().sample(n=30) # take a sample of 30 rain observations from armagh\n", |
| 424 | + "stornoway_rain = selected.loc[selected['station'] == 'stornoway', 'rain'].dropna().sample(n=30) # take a sample of 30 rain observations from stornoway\n", |
363 | 425 | "\n",
|
364 | 426 | "# test whether stornoway_rain.mean() > armagh_rain.mean() at the 99% confidence level\n",
|
365 | 427 | "rain_comp = pg.ttest(stornoway_rain, armagh_rain, alternative='greater', confidence=0.99)\n",
|
|
0 commit comments