Skip to content

Commit 194b52a

Browse files
authored
Merge branch 'main' into inline-instructor-notes
2 parents 8bd0176 + 1021a9c commit 194b52a

File tree

2 files changed

+145
-164
lines changed

2 files changed

+145
-164
lines changed

episodes/05-merging-data.md

+145-7
Original file line numberDiff line numberDiff line change
@@ -149,11 +149,42 @@ new_output = pd.read_csv('data/out.csv', keep_default_na=False, na_values=[""])
149149

150150
### Challenge - Combine Data
151151

152-
In the data folder, there are two survey data files: `surveys2001.csv` and
153-
`surveys2002.csv`. Read the data into pandas and combine the files to make one
154-
new DataFrame. Create a plot of average plot weight by year grouped by sex.
152+
In the data folder, there is another folder called `yearly_files`
153+
that contains survey data broken down into individual files by year.
154+
Read the data from two of these files,
155+
`surveys2001.csv` and `surveys2002.csv`,
156+
into pandas and combine the files to make one new DataFrame.
157+
Create a plot of average plot weight by year grouped by sex.
155158
Export your results as a CSV and make sure it reads back into pandas properly.
156159

160+
::::::::::::::::::::::: solution
161+
162+
```python
163+
# read the files:
164+
survey2001 = pd.read_csv("data/yearly_files/surveys2001.csv")
165+
survey2002 = pd.read_csv("data/yearly_files/surveys2002.csv")
166+
# concatenate
167+
survey_all = pd.concat([survey2001, survey2002], axis=0)
168+
# get the weight for each year, grouped by sex:
169+
weight_year = survey_all.groupby(['year', 'sex']).mean()["wgt"].unstack()
170+
# plot:
171+
weight_year.plot(kind="bar")
172+
plt.tight_layout() # tip: use this to improve the plot layout.
173+
# Try running the code without this line to see
174+
# what difference applying plt.tight_layout() makes.
175+
```
176+
177+
![](fig/04_chall_weight_year.png){alt='average weight for each year, grouped by sex'}
178+
179+
```python
180+
# writing to file:
181+
weight_year.to_csv("weight_for_year.csv")
182+
# reading it back in:
183+
pd.read_csv("weight_for_year.csv", index_col=0)
184+
```
185+
186+
::::::::::::::::::::::::::::::::
187+
157188

158189
::::::::::::::::::::::::::::::::::::::::::::::::::
159190

@@ -425,10 +456,88 @@ Create a new DataFrame by joining the contents of the `surveys.csv` and
425456

426457
1. taxa by plot
427458
2. taxa by sex by plot
459+
460+
::::::::::::::::::::::: solution
461+
462+
```python
463+
merged_left = pd.merge(left=surveys_df,right=species_df, how='left', on="species_id")
464+
```
465+
466+
1. taxa per plot (number of species of each taxa per plot):
467+
468+
```python
469+
merged_left.groupby(["plot_id"])["taxa"].nunique().plot(kind='bar')
470+
```
471+
472+
![](fig/04_chall_ntaxa_per_site.png){alt='taxa per plot'}
473+
474+
*Suggestion*: It is also possible to plot the number of individuals for each taxa in each plot
475+
(stacked bar chart):
428476

477+
```python
478+
merged_left.groupby(["plot_id", "taxa"]).count()["record_id"].unstack().plot(kind='bar', stacked=True)
479+
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.05)) # stop the legend from overlapping with the bar plot
480+
```
481+
482+
![](fig/04_chall_taxa_per_site.png){alt='taxa per plot'}
483+
484+
2. taxa by sex by plot:
485+
Providing the Nan values with the M|F values (can also already be changed to 'x'):
486+
487+
```python
488+
merged_left.loc[merged_left["sex"].isnull(), "sex"] = 'M|F'
489+
ntaxa_sex_site= merged_left.groupby(["plot_id", "sex"])["taxa"].nunique().reset_index(level=1)
490+
ntaxa_sex_site = ntaxa_sex_site.pivot_table(values="taxa", columns="sex", index=ntaxa_sex_site.index)
491+
ntaxa_sex_site.plot(kind="bar", legend=False, stacked=True)
492+
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.08),
493+
fontsize='small', frameon=False)
494+
```
495+
496+
![](fig/04_chall_ntaxa_per_site_sex.png){alt='taxa per plot per sex'}
497+
498+
::::::::::::::::::::::::::::::::
429499

430500
::::::::::::::::::::::::::::::::::::::::::::::::::
431501

502+
::::::::::::::::::::::: instructor
503+
504+
## Suggestion (for discussion only)
505+
506+
The number of individuals for each taxa in each plot per sex can be derived as well.
507+
508+
```python
509+
sex_taxa_site = merged_left.groupby(["plot_id", "taxa", "sex"]).count()['record_id']
510+
sex_taxa_site.unstack(level=[1, 2]).plot(kind='bar', logy=True)
511+
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.15),
512+
fontsize='small', frameon=False)
513+
```
514+
515+
![](fig/04_chall_sex_taxa_site_intro.png){alt='taxa per plot per sex'}
516+
517+
This is not really the best plot choice, e.g. it is not easily readable.
518+
A first option to make this better, is to make facets.
519+
However, pandas/matplotlib do not provide this by default.
520+
Just as a pure matplotlib example (`M|F` if for not-defined sex records):
521+
522+
```python
523+
fig, axs = plt.subplots(3, 1)
524+
for sex, ax in zip(["M", "F", "M|F"], axs):
525+
sex_taxa_site[sex_taxa_site["sex"] == sex].plot(kind='bar', ax=ax, legend=False)
526+
ax.set_ylabel(sex)
527+
if not ax.is_last_row():
528+
ax.set_xticks([])
529+
ax.set_xlabel("")
530+
axs[0].legend(loc='upper center', ncol=5, bbox_to_anchor=(0.5, 1.3),
531+
fontsize='small', frameon=False)
532+
```
533+
534+
![](fig/04_chall_sex_taxa_site.png){alt='taxa per plot per sex'}
535+
536+
However, it would be better to link to [Seaborn][seaborn]
537+
and [Altair][altair] for this kind of multivariate visualisation.
538+
539+
::::::::::::::::::::::::::::::::::
540+
432541
::::::::::::::::::::::::::::::::::::::: challenge
433542

434543
### Challenge - Diversity Index
@@ -441,17 +550,46 @@ Create a new DataFrame by joining the contents of the `surveys.csv` and
441550
plots. The index should consider both species abundance and number of
442551
species. You might choose to use the simple [biodiversity index described
443552
here](https://www.amnh.org/explore/curriculum-collections/biodiversity-counts/plant-ecology/how-to-calculate-a-biodiversity-index)
444-
which calculates diversity as:
553+
which calculates diversity as: the number of species in the plot / the total number of individuals in the plot = Biodiversity index.
445554

446-
the number of species in the plot / the total number of individuals in the plot = Biodiversity index.
555+
::::::::::::::::::::::: solution
556+
557+
1.
558+
```python
559+
plot_info = pd.read_csv("data/plots.csv")
560+
plot_info.groupby("plot_type").count()
561+
```
562+
563+
2.
564+
```python
565+
merged_site_type = pd.merge(merged_left, plot_info, on='plot_id')
566+
# For each plot, get the number of species for each plot
567+
nspecies_site = merged_site_type.groupby(["plot_id"])["species"].nunique().rename("nspecies")
568+
# For each plot, get the number of individuals
569+
nindividuals_site = merged_site_type.groupby(["plot_id"]).count()['record_id'].rename("nindiv")
570+
# combine the two series
571+
diversity_index = pd.concat([nspecies_site, nindividuals_site], axis=1)
572+
# calculate the diversity index
573+
diversity_index['diversity'] = diversity_index['nspecies']/diversity_index['nindiv']
574+
```
447575

576+
Making a bar chart from this diversity index:
577+
578+
```python
579+
diversity_index['diversity'].plot(kind="barh")
580+
plt.xlabel("Diversity index")
581+
```
448582

449-
::::::::::::::::::::::::::::::::::::::::::::::::::
583+
![](fig/04_chall_diversity_index.png){alt='horizontal bar chart of diversity index by plot'}
450584

585+
::::::::::::::::::::::::::::::::
451586

587+
::::::::::::::::::::::::::::::::::::::::::::::::::
452588

453-
[join-types]: https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
454589

590+
[altair]: https://github.com/ellisonbg/altair
591+
[join-types]: https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
592+
[seaborn]: https://stanford.edu/~mwaskom/software/seaborn
455593

456594
:::::::::::::::::::::::::::::::::::::::: keypoints
457595

instructors/instructor-notes.md

-157
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,6 @@
22
title: Instructor Notes
33
---
44

5-
# Challenge solutions
6-
75
## Install the required workshop packages
86

97
Please use the instructions in the [Setup][lesson-setup] document to perform installs. If you
@@ -27,159 +25,6 @@ If learners receive an `AssertionError`, it will inform you how to help them cor
2725
installation. Otherwise, it will tell you that the system is good to go and ready for Data
2826
Carpentry!
2927

30-
## 04-data-types-and-format
31-
32-
### Writing Out Data to CSV
33-
34-
If the students have trouble generating the output, or anything happens with that, the folder
35-
[`sample_output`](https://github.com/datacarpentry/python-ecology-lesson/tree/main/sample_output)
36-
in this repository contains the file `surveys_complete.csv` with the data they should generate.
37-
38-
## 05-merging-data
39-
40-
- In the data folder, there are two survey data files: survey2001.csv and survey2002.csv. Read the
41-
data into Python and combine the files to make one new data frame. Create a plot of average plot
42-
weight by year grouped by sex. Export your results as a CSV and make sure it reads back into
43-
Python properly.
44-
45-
```python
46-
# read the files:
47-
survey2001 = pd.read_csv("data/survey2001.csv")
48-
survey2002 = pd.read_csv("data/survey2002.csv")
49-
# concatenate
50-
survey_all = pd.concat([survey2001, survey2002], axis=0)
51-
# get the weight for each year, grouped by sex:
52-
weight_year = survey_all.groupby(['year', 'sex']).mean()["wgt"].unstack()
53-
# plot:
54-
weight_year.plot(kind="bar")
55-
plt.tight_layout() # tip(!)
56-
```
57-
58-
![](fig/04_chall_weight_year.png){alt='average weight for each year, grouped by sex'}
59-
60-
```python
61-
# writing to file:
62-
weight_year.to_csv("weight_for_year.csv")
63-
# reading it back in:
64-
pd.read_csv("weight_for_year.csv", index_col=0)
65-
```
66-
67-
- Create a new DataFrame by joining the contents of the surveys.csv and species.csv tables.
68-
69-
```python
70-
merged_left = pd.merge(left=surveys_df,right=species_df, how='left', on="species_id")
71-
```
72-
73-
Then calculate and plot the distribution of:
74-
75-
**1\. taxa per plot** (number of species of each taxa per plot):
76-
77-
Species distribution (number of taxa for each plot) can be derived as follows:
78-
79-
```python
80-
merged_left.groupby(["plot_id"])["taxa"].nunique().plot(kind='bar')
81-
```
82-
83-
![](fig/04_chall_ntaxa_per_site.png){alt='taxa per plot'}
84-
85-
*Suggestion*: It is also possible to plot the number of individuals for each taxa in each plot
86-
(stacked bar chart):
87-
88-
```python
89-
merged_left.groupby(["plot_id", "taxa"]).count()["record_id"].unstack().plot(kind='bar', stacked=True)
90-
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.05))
91-
```
92-
93-
(the legend otherwise overlaps the bar plot)
94-
95-
![](fig/04_chall_taxa_per_site.png){alt='taxa per plot'}
96-
97-
**2\. taxa by sex by plot**:
98-
Providing the Nan values with the M|F values (can also already be changed to 'x'):
99-
100-
```python
101-
merged_left.loc[merged_left["sex"].isnull(), "sex"] = 'M|F'
102-
```
103-
104-
Number of taxa for each plot/sex combination:
105-
106-
```python
107-
ntaxa_sex_site= merged_left.groupby(["plot_id", "sex"])["taxa"].nunique().reset_index(level=1)
108-
ntaxa_sex_site = ntaxa_sex_site.pivot_table(values="taxa", columns="sex", index=ntaxa_sex_site.index)
109-
ntaxa_sex_site.plot(kind="bar", legend=False)
110-
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.08),
111-
fontsize='small', frameon=False)
112-
```
113-
114-
![](fig/04_chall_ntaxa_per_site_sex.png){alt='taxa per plot per sex'}
115-
116-
*Suggestion (for discussion only)*:
117-
118-
The number of individuals for each taxa in each plot per sex can be derived as well.
119-
120-
```python
121-
sex_taxa_site = merged_left.groupby(["plot_id", "taxa", "sex"]).count()['record_id']
122-
sex_taxa_site.unstack(level=[1, 2]).plot(kind='bar', logy=True)
123-
plt.legend(loc='upper center', ncol=3, bbox_to_anchor=(0.5, 1.15),
124-
fontsize='small', frameon=False)
125-
```
126-
127-
![](fig/04_chall_sex_taxa_site_intro.png){alt='taxa per plot per sex'}
128-
129-
This is not really the best plot choice: not readable,... A first option to make this better, is to
130-
make facets. However, pandas/matplotlib do not provide this by default. Just as a pure matplotlib
131-
example (`M|F` if for not-defined sex records):
132-
133-
```python
134-
fig, axs = plt.subplots(3, 1)
135-
for sex, ax in zip(["M", "F", "M|F"], axs):
136-
sex_taxa_site[sex_taxa_site["sex"] == sex].plot(kind='bar', ax=ax, legend=False)
137-
ax.set_ylabel(sex)
138-
if not ax.is_last_row():
139-
ax.set_xticks([])
140-
ax.set_xlabel("")
141-
axs[0].legend(loc='upper center', ncol=5, bbox_to_anchor=(0.5, 1.3),
142-
fontsize='small', frameon=False)
143-
```
144-
145-
![](fig/04_chall_sex_taxa_site.png){alt='taxa per plot per sex'}
146-
147-
However, it would be better to link to [Seaborn][seaborn] and [Altair][altair] for its kind of
148-
multivariate visualisations.
149-
150-
- In the data folder, there is a plot CSV that contains information about the type associated with
151-
each plot. Use that data to summarize the number of plots by plot type.
152-
153-
```python
154-
plot_info = pd.read_csv("data/plots.csv")
155-
plot_info.groupby("plot_type").count()
156-
```
157-
158-
- Calculate a diversity index of your choice for control vs rodent exclosure plots. The index should
159-
consider both species abundance and number of species. You might choose the simple biodiversity
160-
index described here which calculates diversity as `the number of species in the plot / the total number of individuals in the plot = Biodiversity index.`
161-
162-
```python
163-
merged_site_type = pd.merge(merged_left, plot_info, on='plot_id')
164-
# For each plot, get the number of species for each plot
165-
nspecies_site = merged_site_type.groupby(["plot_id"])["species"].nunique().rename("nspecies")
166-
# For each plot, get the number of individuals
167-
nindividuals_site = merged_site_type.groupby(["plot_id"]).count()['record_id'].rename("nindiv")
168-
# combine the two series
169-
diversity_index = pd.concat([nspecies_site, nindividuals_site], axis=1)
170-
# calculate the diversity index
171-
diversity_index['diversity'] = diversity_index['nspecies']/diversity_index['nindiv']
172-
```
173-
174-
Making a bar chart:
175-
176-
```python
177-
diversity_index['diversity'].plot(kind="barh")
178-
plt.xlabel("Diversity index")
179-
```
180-
181-
![](fig/04_chall_diversity_index.png){alt='taxa per plot per sex'}
182-
18328
## 07-visualization-ggplot-python
18429

18530
iPython notebooks for plotting can be viewed in the `learners` folder.
@@ -219,8 +64,6 @@ plt.show()
21964

22065
[This page][matplotlib-mathtext] contains more information.
22166

222-
[seaborn]: https://stanford.edu/~mwaskom/software/seaborn
223-
[altair]: https://github.com/ellisonbg/altair
22467
[matplotlib-mathtext]: https://matplotlib.org/users/mathtext.html
22568

22669

0 commit comments

Comments
 (0)