@@ -149,11 +149,42 @@ new_output = pd.read_csv('data/out.csv', keep_default_na=False, na_values=[""])
149
149
150
150
### Challenge - Combine Data
151
151
152
- In the data folder, there are two survey data files: ` surveys2001.csv ` and
153
- ` surveys2002.csv ` . Read the data into pandas and combine the files to make one
154
- new DataFrame. Create a plot of average plot weight by year grouped by sex.
152
+ In the data folder, there is another folder called ` yearly_files `
153
+ that contains survey data broken down into individual files by year.
154
+ Read the data from two of these files,
155
+ ` surveys2001.csv ` and ` surveys2002.csv ` ,
156
+ into pandas and combine the files to make one new DataFrame.
157
+ Create a plot of average plot weight by year grouped by sex.
155
158
Export your results as a CSV and make sure it reads back into pandas properly.
156
159
160
+ ::::::::::::::::::::::: solution
161
+
162
+ ``` python
163
+ # read the files:
164
+ survey2001 = pd.read_csv(" data/yearly_files/surveys2001.csv" )
165
+ survey2002 = pd.read_csv(" data/yearly_files/surveys2002.csv" )
166
+ # concatenate
167
+ survey_all = pd.concat([survey2001, survey2002], axis = 0 )
168
+ # get the weight for each year, grouped by sex:
169
+ weight_year = survey_all.groupby([' year' , ' sex' ]).mean()[" wgt" ].unstack()
170
+ # plot:
171
+ weight_year.plot(kind = " bar" )
172
+ plt.tight_layout() # tip: use this to improve the plot layout.
173
+ # Try running the code without this line to see
174
+ # what difference applying plt.tight_layout() makes.
175
+ ```
176
+
177
+ ![ ] ( fig/04_chall_weight_year.png ) {alt='average weight for each year, grouped by sex'}
178
+
179
+ ``` python
180
+ # writing to file:
181
+ weight_year.to_csv(" weight_for_year.csv" )
182
+ # reading it back in:
183
+ pd.read_csv(" weight_for_year.csv" , index_col = 0 )
184
+ ```
185
+
186
+ ::::::::::::::::::::::::::::::::
187
+
157
188
158
189
::::::::::::::::::::::::::::::::::::::::::::::::::
159
190
@@ -425,10 +456,88 @@ Create a new DataFrame by joining the contents of the `surveys.csv` and
425
456
426
457
1 . taxa by plot
427
458
2 . taxa by sex by plot
459
+
460
+ ::::::::::::::::::::::: solution
461
+
462
+ ``` python
463
+ merged_left = pd.merge(left = surveys_df,right = species_df, how = ' left' , on = " species_id" )
464
+ ```
465
+
466
+ 1 . taxa per plot (number of species of each taxa per plot):
467
+
468
+ ``` python
469
+ merged_left.groupby([" plot_id" ])[" taxa" ].nunique().plot(kind = ' bar' )
470
+ ```
471
+
472
+ ![ ] ( fig/04_chall_ntaxa_per_site.png ) {alt='taxa per plot'}
473
+
474
+ * Suggestion* : It is also possible to plot the number of individuals for each taxa in each plot
475
+ (stacked bar chart):
428
476
477
+ ``` python
478
+ merged_left.groupby([" plot_id" , " taxa" ]).count()[" record_id" ].unstack().plot(kind = ' bar' , stacked = True )
479
+ plt.legend(loc = ' upper center' , ncol = 3 , bbox_to_anchor = (0.5 , 1.05 )) # stop the legend from overlapping with the bar plot
480
+ ```
481
+
482
+ ![ ] ( fig/04_chall_taxa_per_site.png ) {alt='taxa per plot'}
483
+
484
+ 2 . taxa by sex by plot:
485
+ Providing the Nan values with the M|F values (can also already be changed to 'x'):
486
+
487
+ ``` python
488
+ merged_left.loc[merged_left[" sex" ].isnull(), " sex" ] = ' M|F'
489
+ ntaxa_sex_site= merged_left.groupby([" plot_id" , " sex" ])[" taxa" ].nunique().reset_index(level = 1 )
490
+ ntaxa_sex_site = ntaxa_sex_site.pivot_table(values = " taxa" , columns = " sex" , index = ntaxa_sex_site.index)
491
+ ntaxa_sex_site.plot(kind = " bar" , legend = False , stacked = True )
492
+ plt.legend(loc = ' upper center' , ncol = 3 , bbox_to_anchor = (0.5 , 1.08 ),
493
+ fontsize = ' small' , frameon = False )
494
+ ```
495
+
496
+ ![ ] ( fig/04_chall_ntaxa_per_site_sex.png ) {alt='taxa per plot per sex'}
497
+
498
+ ::::::::::::::::::::::::::::::::
429
499
430
500
::::::::::::::::::::::::::::::::::::::::::::::::::
431
501
502
+ ::::::::::::::::::::::: instructor
503
+
504
+ ## Suggestion (for discussion only)
505
+
506
+ The number of individuals for each taxa in each plot per sex can be derived as well.
507
+
508
+ ``` python
509
+ sex_taxa_site = merged_left.groupby([" plot_id" , " taxa" , " sex" ]).count()[' record_id' ]
510
+ sex_taxa_site.unstack(level = [1 , 2 ]).plot(kind = ' bar' , logy = True )
511
+ plt.legend(loc = ' upper center' , ncol = 3 , bbox_to_anchor = (0.5 , 1.15 ),
512
+ fontsize = ' small' , frameon = False )
513
+ ```
514
+
515
+ ![ ] ( fig/04_chall_sex_taxa_site_intro.png ) {alt='taxa per plot per sex'}
516
+
517
+ This is not really the best plot choice, e.g. it is not easily readable.
518
+ A first option to make this better, is to make facets.
519
+ However, pandas/matplotlib do not provide this by default.
520
+ Just as a pure matplotlib example (` M|F ` if for not-defined sex records):
521
+
522
+ ``` python
523
+ fig, axs = plt.subplots(3 , 1 )
524
+ for sex, ax in zip ([" M" , " F" , " M|F" ], axs):
525
+ sex_taxa_site[sex_taxa_site[" sex" ] == sex].plot(kind = ' bar' , ax = ax, legend = False )
526
+ ax.set_ylabel(sex)
527
+ if not ax.is_last_row():
528
+ ax.set_xticks([])
529
+ ax.set_xlabel(" " )
530
+ axs[0 ].legend(loc = ' upper center' , ncol = 5 , bbox_to_anchor = (0.5 , 1.3 ),
531
+ fontsize = ' small' , frameon = False )
532
+ ```
533
+
534
+ ![ ] ( fig/04_chall_sex_taxa_site.png ) {alt='taxa per plot per sex'}
535
+
536
+ However, it would be better to link to [ Seaborn] [ seaborn ]
537
+ and [ Altair] [ altair ] for this kind of multivariate visualisation.
538
+
539
+ ::::::::::::::::::::::::::::::::::
540
+
432
541
::::::::::::::::::::::::::::::::::::::: challenge
433
542
434
543
### Challenge - Diversity Index
@@ -441,17 +550,46 @@ Create a new DataFrame by joining the contents of the `surveys.csv` and
441
550
plots. The index should consider both species abundance and number of
442
551
species. You might choose to use the simple [ biodiversity index described
443
552
here] ( https://www.amnh.org/explore/curriculum-collections/biodiversity-counts/plant-ecology/how-to-calculate-a-biodiversity-index )
444
- which calculates diversity as:
553
+ which calculates diversity as: the number of species in the plot / the total number of individuals in the plot = Biodiversity index.
445
554
446
- the number of species in the plot / the total number of individuals in the plot = Biodiversity index.
555
+ ::::::::::::::::::::::: solution
556
+
557
+ 1 .
558
+ ``` python
559
+ plot_info = pd.read_csv(" data/plots.csv" )
560
+ plot_info.groupby(" plot_type" ).count()
561
+ ```
562
+
563
+ 2 .
564
+ ``` python
565
+ merged_site_type = pd.merge(merged_left, plot_info, on = ' plot_id' )
566
+ # For each plot, get the number of species for each plot
567
+ nspecies_site = merged_site_type.groupby([" plot_id" ])[" species" ].nunique().rename(" nspecies" )
568
+ # For each plot, get the number of individuals
569
+ nindividuals_site = merged_site_type.groupby([" plot_id" ]).count()[' record_id' ].rename(" nindiv" )
570
+ # combine the two series
571
+ diversity_index = pd.concat([nspecies_site, nindividuals_site], axis = 1 )
572
+ # calculate the diversity index
573
+ diversity_index[' diversity' ] = diversity_index[' nspecies' ]/ diversity_index[' nindiv' ]
574
+ ```
447
575
576
+ Making a bar chart from this diversity index:
577
+
578
+ ``` python
579
+ diversity_index[' diversity' ].plot(kind = " barh" )
580
+ plt.xlabel(" Diversity index" )
581
+ ```
448
582
449
- ::::::::::::::::::::::::::::::::::::::::::::::::::
583
+ ![ ] ( fig/04_chall_diversity_index.png ) {alt='horizontal bar chart of diversity index by plot'}
450
584
585
+ ::::::::::::::::::::::::::::::::
451
586
587
+ ::::::::::::::::::::::::::::::::::::::::::::::::::
452
588
453
- [ join-types ] : https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
454
589
590
+ [ altair ] : https://github.com/ellisonbg/altair
591
+ [ join-types ] : https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
592
+ [ seaborn ] : https://stanford.edu/~mwaskom/software/seaborn
455
593
456
594
:::::::::::::::::::::::::::::::::::::::: keypoints
457
595
0 commit comments