@@ -18,7 +18,7 @@ objectives:
18
18
- " Perform basic mathematical operations and summary statistics on data in a Pandas DataFrame."
19
19
- " Create simple plots."
20
20
keypoints :
21
- - " Libraries enable us to extend the functionality of Python."
21
+ - " Libraries enable us to extend the functionality of Python."
22
22
- " Pandas is a popular library for working with data."
23
23
- " A Dataframe is a Pandas data structure that allows one to access data by column (name or index) or row."
24
24
- " Aggregating data using the `groupby()` function enables you to generate useful summaries of data quickly."
@@ -37,7 +37,7 @@ and they can replicate the same analysis.
37
37
38
38
To help the lesson run smoothly, let's ensure everyone is in the same directory.
39
39
This should help us avoid path and file name issues. At this time please
40
- navigate to the workshop directory. If you working in IPython Notebook be sure
40
+ navigate to the workshop directory. If you are working in IPython Notebook be sure
41
41
that you start your notebook in the workshop directory.
42
42
43
43
A quick aside that there are Python libraries like [ OS Library] [ os-lib ] that can work with our
@@ -93,7 +93,8 @@ record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
93
93
A library in Python contains a set of tools (called functions) that perform
94
94
tasks on our data. Importing a library is like getting a piece of lab equipment
95
95
out of a storage locker and setting it up on the bench for use in a project.
96
- Once a library is set up, it can be used or called to perform many tasks.
96
+ Once a library is set up, it can be used or called to perform the task(s)
97
+ it was built to do.
97
98
98
99
## Pandas in Python
99
100
One of the best options for working with tabular data in Python is to use the
@@ -124,7 +125,7 @@ time we call a Pandas function.
124
125
# Reading CSV Data Using Pandas
125
126
126
127
We will begin by locating and reading our survey data which are in CSV format. CSV stands for
127
- Comma-Separated Values and is a common way store formatted data. Other symbols may also be used, so
128
+ Comma-Separated Values and is a common way to store formatted data. Other symbols may also be used, so
128
129
you might see tab-separated, colon-separated or space separated files. It is quite easy to replace
129
130
one separator with another, to match your application. The first line in the file often has headers
130
131
to explain what is in each column. CSV (and other separators) make it easy to share data, and can be
@@ -486,8 +487,8 @@ summary stats.
486
487
>
487
488
> 1 . How many recorded individuals are female ` F ` and how many male ` M ` ?
488
489
> 2 . What happens when you group by two columns using the following syntax and
489
- > then grab mean values?
490
- > - ` grouped_data2 = surveys_df.groupby(['plot_id','sex']) `
490
+ > then calculate mean values?
491
+ > - ` grouped_data2 = surveys_df.groupby(['plot_id', 'sex']) `
491
492
> - ` grouped_data2.mean() `
492
493
> 3 . Summarize weight values for each site in your data. HINT: you can use the
493
494
> following syntax to only create summary statistics for one column in your data.
@@ -536,7 +537,7 @@ surveys_df.groupby('species_id')['record_id'].count()['DO']
536
537
> ## Challenge - Make a list
537
538
>
538
539
> What's another way to create a list of species and associated `count` of the
539
- > records in the data? Hint: you can perform `count`, `min`, etc functions on
540
+ > records in the data? Hint: you can perform `count`, `min`, etc. functions on
540
541
> groupby DataFrames in the same way you can perform them on regular DataFrames.
541
542
{: .challenge}
542
543
@@ -589,13 +590,13 @@ total_count.plot(kind='bar');
589
590
> being sex. The plot should show total weight by sex for each site. Some
590
591
> tips are below to help you solve this challenge:
591
592
>
592
- > * For more on Pandas plots, visit this [link ][pandas-plot].
593
+ > * For more information on pandas plots, see [pandas' documentation page on visualization ][pandas-plot].
593
594
> * You can use the code that follows to create a stacked bar plot but the data to stack
594
595
> need to be in individual columns. Here's a simple example with some data where
595
596
> 'a', 'b', and 'c' are the groups, and 'one' and 'two' are the subgroups.
596
597
>
597
598
> ~~~
598
- > d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
599
+ > d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
599
600
> pd.DataFrame(d)
600
601
> ~~~
601
602
> {: .language-python }
@@ -616,7 +617,7 @@ total_count.plot(kind='bar');
616
617
> ~~~
617
618
> # Plot stacked data so columns 'one' and 'two' are stacked
618
619
> my_df = pd.DataFrame(d)
619
- > my_df.plot(kind='bar',stacked=True,title="The title of my graph")
620
+ > my_df.plot(kind='bar', stacked=True, title="The title of my graph")
620
621
> ~~~
621
622
> {: .language-python }
622
623
>
@@ -635,7 +636,7 @@ total_count.plot(kind='bar');
635
636
>> First we group data by site and by sex, and then calculate a total for each site.
636
637
>>
637
638
>> ~~~
638
- >> by_site_sex = surveys_df.groupby(['plot_id','sex'])
639
+ >> by_site_sex = surveys_df.groupby(['plot_id', 'sex'])
639
640
>> site_sex_count = by_site_sex['weight'].sum()
640
641
>> ~~~
641
642
>> {: .language-python}
@@ -660,7 +661,7 @@ total_count.plot(kind='bar');
660
661
>> Below we'll use `.unstack()` on our grouped data to figure out the total weight that each sex contributed to each site.
661
662
>>
662
663
>> ~~~
663
- >> by_site_sex = surveys_df.groupby(['plot_id','sex'])
664
+ >> by_site_sex = surveys_df.groupby(['plot_id', 'sex'])
664
665
>> site_sex_count = by_site_sex['weight'].sum()
665
666
>> site_sex_count.unstack()
666
667
>> ~~~
@@ -684,10 +685,10 @@ total_count.plot(kind='bar');
684
685
>> Rather than display it as a table, we can plot the above data by stacking the values of each sex as follows:
685
686
>>
686
687
>> ~~~
687
- >> by_site_sex = surveys_df.groupby(['plot_id','sex'])
688
+ >> by_site_sex = surveys_df.groupby(['plot_id', 'sex'])
688
689
>> site_sex_count = by_site_sex['weight'].sum()
689
690
>> spc = site_sex_count.unstack()
690
- >> s_plot = spc.plot(kind='bar',stacked=True,title="Total weight by site and sex")
691
+ >> s_plot = spc.plot(kind='bar', stacked=True, title="Total weight by site and sex")
691
692
>> s_plot.set_ylabel("Weight")
692
693
>> s_plot.set_xlabel("Plot")
693
694
>> ~~~
0 commit comments