Skip to content

Commit 784512d

Browse files
Assorted fixes and improvements (datacarpentry#435)
* Remove duplicated words * Insert missing particle * Specify tasks of a library * Fix typos * Prettify some code towards PEP8 * Emphasize site of vertical stacking as it already is for horizontal stacking ("RIGHT") * Remove console indicators * Fix typos (again) * Use StackExchange's full tag syntax * Update pandas docu links * Deduplicate some docu links * Improve accessiblity of link text * Estimate 01-short-intro... with 30min * Fix typo * Prettify some code towards PEP8 * Revert "Deduplicate some docu links" * Apply docstring conventions (https://www.python.org/dev/peps/pep-0257/) * Fix typo * Update & rephrase numpy.random link * Prettify some code towards PEP8 * Update PyPI link * Don't apply all docstring conventions * Order keywords in docstring as in function signature * Prettify some code towards PEP8 * Partially revert "Update PyPI link" * Prettify some code towards PEP8 * Avoid "easily" wording * Avoid "this link" wording * Deduplicate link & avoid "this page" in its link text
1 parent 6f47959 commit 784512d

11 files changed

+78
-76
lines changed

_episodes/00-before-we-start.md

+7-5
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ mean it is easier for new members of the community to get up to speed.
4444
Reproducibility is the ability to obtain the same results using the same dataset(s) and analysis.
4545

4646
Data analysis written as a Python script can be reproduced on any platform. Moreover, if you
47-
collect more or correct existing data, you can quickly and easily re-run your analysis!
47+
collect more or correct existing data, you can quickly re-run your analysis!
4848

4949
An increasing number of journals and funding agencies expect analyses to be reproducible,
5050
so knowing Python will give you an edge with these requirements.
@@ -77,7 +77,7 @@ such as the IPython console, Jupyter Notebook, and Spyder IDE.
7777
Have a quick look around the Anaconda Navigator. You can launch programs from the Navigator or use the command line.
7878

7979
The [Jupyter Notebook](https://jupyter.org) is an open-source web application that allows you to create
80-
and share documents that allow one to easilty create documents that combine code, graphs, and narrative text.
80+
and share documents that allow one to create documents that combine code, graphs, and narrative text.
8181
[Spyder][spyder-ide] is an **Integrated Development Environment** that
8282
allows one to write Python scripts and interact with the Python software from within a single interface.
8383

@@ -147,7 +147,7 @@ default.
147147

148148
Since we want our code and workflow to be reproducible, it is better to type the commands in
149149
the script editor, and save them as a script. This way, there is a complete record of what we did,
150-
and anyone (including our future selves!) can easily reproduce the results on their computer.
150+
and anyone (including our future selves!) has an easier time reproducing the results on their computer.
151151

152152
Spyder allows you to execute commands directly from the script editor by using the run buttons on
153153
top. To run the entire script click _Run file_ or press <kbd>F5</kbd>, to run the current line
@@ -189,6 +189,7 @@ code to suit your purpose might make it easier for you to get started.
189189
* type `help()`
190190
* type `?object` or `help(object)` to get information about an object
191191
* [Python documentation][python-docs]
192+
* [Pandas documentation][pandas-docs]
192193

193194
Finally, a generic Google or internet search "Python task" will often either send you to the
194195
appropriate module documentation or a helpful forum where someone else has already asked your
@@ -201,7 +202,7 @@ messages that might not be very helpful to diagnose a problem (e.g. "subscript o
201202
the message is very generic, you might also include the name of the function or package you’re using
202203
in your query.
203204

204-
However, you should check Stack Overflow. Search using the `python` tag. Most questions have already
205+
However, you should check Stack Overflow. Search using the `[python]` tag. Most questions have already
205206
been answered, but the challenge is to use the right words in the search to find the answers:
206207
<https://stackoverflow.com/questions/tagged/python?tab=Votes>
207208

@@ -245,7 +246,8 @@ ask a good question.
245246
[anaconda]: https://www.anaconda.com
246247
[anaconda-community]: https://www.anaconda.com/community
247248
[dive-into-python3]: https://finderiko.com/python-book
248-
[pypi]: https://pypi.python.org/pypi
249+
[pandas-docs]: https://pandas.pydata.org/pandas-docs/stable/
250+
[pypi]: https://pypi.org/
249251
[python-docs]: https://www.python.org/doc
250252
[python-guide]: https://docs.python-guide.org
251253
[python-mailing-lists]: https://www.python.org/community/lists

_episodes/01-short-introduction-to-Python.md

+3-4
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Short Introduction to Programming in Python
3-
teaching: 0
3+
teaching: 30
44
exercises: 0
55
questions:
66
- "What is Python?"
@@ -290,9 +290,8 @@ A `for` loop can be used to access the elements in a list or other Python data
290290
structure one at a time:
291291

292292
~~~
293-
>>> for num in numbers:
294-
... print(num)
295-
...
293+
for num in numbers:
294+
print(num)
296295
~~~
297296
{: .language-python}
298297

_episodes/02-starting-with-data.md

+15-14
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ objectives:
1818
- "Perform basic mathematical operations and summary statistics on data in a Pandas DataFrame."
1919
- "Create simple plots."
2020
keypoints:
21-
- "Libraries enable us to extend the functionality of Python."
21+
- "Libraries enable us to extend the functionality of Python."
2222
- "Pandas is a popular library for working with data."
2323
- "A Dataframe is a Pandas data structure that allows one to access data by column (name or index) or row."
2424
- "Aggregating data using the `groupby()` function enables you to generate useful summaries of data quickly."
@@ -37,7 +37,7 @@ and they can replicate the same analysis.
3737

3838
To help the lesson run smoothly, let's ensure everyone is in the same directory.
3939
This should help us avoid path and file name issues. At this time please
40-
navigate to the workshop directory. If you working in IPython Notebook be sure
40+
navigate to the workshop directory. If you are working in IPython Notebook be sure
4141
that you start your notebook in the workshop directory.
4242

4343
A quick aside that there are Python libraries like [OS Library][os-lib] that can work with our
@@ -93,7 +93,8 @@ record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
9393
A library in Python contains a set of tools (called functions) that perform
9494
tasks on our data. Importing a library is like getting a piece of lab equipment
9595
out of a storage locker and setting it up on the bench for use in a project.
96-
Once a library is set up, it can be used or called to perform many tasks.
96+
Once a library is set up, it can be used or called to perform the task(s)
97+
it was built to do.
9798

9899
## Pandas in Python
99100
One of the best options for working with tabular data in Python is to use the
@@ -124,7 +125,7 @@ time we call a Pandas function.
124125
# Reading CSV Data Using Pandas
125126

126127
We will begin by locating and reading our survey data which are in CSV format. CSV stands for
127-
Comma-Separated Values and is a common way store formatted data. Other symbols may also be used, so
128+
Comma-Separated Values and is a common way to store formatted data. Other symbols may also be used, so
128129
you might see tab-separated, colon-separated or space separated files. It is quite easy to replace
129130
one separator with another, to match your application. The first line in the file often has headers
130131
to explain what is in each column. CSV (and other separators) make it easy to share data, and can be
@@ -486,8 +487,8 @@ summary stats.
486487
>
487488
> 1. How many recorded individuals are female `F` and how many male `M`?
488489
> 2. What happens when you group by two columns using the following syntax and
489-
> then grab mean values?
490-
> - `grouped_data2 = surveys_df.groupby(['plot_id','sex'])`
490+
> then calculate mean values?
491+
> - `grouped_data2 = surveys_df.groupby(['plot_id', 'sex'])`
491492
> - `grouped_data2.mean()`
492493
> 3. Summarize weight values for each site in your data. HINT: you can use the
493494
> following syntax to only create summary statistics for one column in your data.
@@ -536,7 +537,7 @@ surveys_df.groupby('species_id')['record_id'].count()['DO']
536537
> ## Challenge - Make a list
537538
>
538539
> What's another way to create a list of species and associated `count` of the
539-
> records in the data? Hint: you can perform `count`, `min`, etc functions on
540+
> records in the data? Hint: you can perform `count`, `min`, etc. functions on
540541
> groupby DataFrames in the same way you can perform them on regular DataFrames.
541542
{: .challenge}
542543
@@ -589,13 +590,13 @@ total_count.plot(kind='bar');
589590
> being sex. The plot should show total weight by sex for each site. Some
590591
> tips are below to help you solve this challenge:
591592
>
592-
> * For more on Pandas plots, visit this [link][pandas-plot].
593+
> * For more information on pandas plots, see [pandas' documentation page on visualization][pandas-plot].
593594
> * You can use the code that follows to create a stacked bar plot but the data to stack
594595
> need to be in individual columns. Here's a simple example with some data where
595596
> 'a', 'b', and 'c' are the groups, and 'one' and 'two' are the subgroups.
596597
>
597598
> ~~~
598-
> d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
599+
> d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']), 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
599600
> pd.DataFrame(d)
600601
> ~~~
601602
> {: .language-python }
@@ -616,7 +617,7 @@ total_count.plot(kind='bar');
616617
> ~~~
617618
> # Plot stacked data so columns 'one' and 'two' are stacked
618619
> my_df = pd.DataFrame(d)
619-
> my_df.plot(kind='bar',stacked=True,title="The title of my graph")
620+
> my_df.plot(kind='bar', stacked=True, title="The title of my graph")
620621
> ~~~
621622
> {: .language-python }
622623
>
@@ -635,7 +636,7 @@ total_count.plot(kind='bar');
635636
>> First we group data by site and by sex, and then calculate a total for each site.
636637
>>
637638
>> ~~~
638-
>> by_site_sex = surveys_df.groupby(['plot_id','sex'])
639+
>> by_site_sex = surveys_df.groupby(['plot_id', 'sex'])
639640
>> site_sex_count = by_site_sex['weight'].sum()
640641
>> ~~~
641642
>> {: .language-python}
@@ -660,7 +661,7 @@ total_count.plot(kind='bar');
660661
>> Below we'll use `.unstack()` on our grouped data to figure out the total weight that each sex contributed to each site.
661662
>>
662663
>> ~~~
663-
>> by_site_sex = surveys_df.groupby(['plot_id','sex'])
664+
>> by_site_sex = surveys_df.groupby(['plot_id', 'sex'])
664665
>> site_sex_count = by_site_sex['weight'].sum()
665666
>> site_sex_count.unstack()
666667
>> ~~~
@@ -684,10 +685,10 @@ total_count.plot(kind='bar');
684685
>> Rather than display it as a table, we can plot the above data by stacking the values of each sex as follows:
685686
>>
686687
>> ~~~
687-
>> by_site_sex = surveys_df.groupby(['plot_id','sex'])
688+
>> by_site_sex = surveys_df.groupby(['plot_id', 'sex'])
688689
>> site_sex_count = by_site_sex['weight'].sum()
689690
>> spc = site_sex_count.unstack()
690-
>> s_plot = spc.plot(kind='bar',stacked=True,title="Total weight by site and sex")
691+
>> s_plot = spc.plot(kind='bar', stacked=True, title="Total weight by site and sex")
691692
>> s_plot.set_ylabel("Weight")
692693
>> s_plot.set_xlabel("Plot")
693694
>> ~~~

_episodes/04-data-types-and-format.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -275,7 +275,7 @@ with weight values > 0 (i.e., select meaningful weight values):
275275
~~~
276276
len(surveys_df[pd.isnull(surveys_df.weight)])
277277
# How many rows have weight values?
278-
len(surveys_df[surveys_df.weight> 0])
278+
len(surveys_df[surveys_df.weight > 0])
279279
~~~
280280
{: .language-python}
281281

_episodes/05-merging-data.md

+11-10
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ keypoints:
2020
In many "real world" situations, the data that we want to use come in multiple
2121
files. We often need to combine these files into a single DataFrame to analyze
2222
the data. The pandas package provides [various methods for combining
23-
DataFrames](http://pandas.pydata.org/pandas-docs/stable/merging.html) including
23+
DataFrames](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) including
2424
`merge` and `concat`.
2525

2626
To work through the examples below, we first need to load the species and
@@ -71,7 +71,7 @@ Take note that the `read_csv` method we used can take some additional options wh
7171
we didn't use previously. Many functions in Python have a set of options that
7272
can be set by the user if needed. In this case, we have told pandas to assign
7373
empty values in our CSV to NaN `keep_default_na=False, na_values=[""]`.
74-
[More about all of the read_csv options here.](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.parsers.read_csv.html)
74+
[More about all of the read_csv options here.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv)
7575

7676
# Concatenating DataFrames
7777

@@ -85,18 +85,18 @@ survey_sub = surveys_df.head(10)
8585
# Grab the last 10 rows
8686
survey_sub_last10 = surveys_df.tail(10)
8787
# Reset the index values to the second dataframe appends properly
88-
survey_sub_last10=survey_sub_last10.reset_index(drop=True)
88+
survey_sub_last10 = survey_sub_last10.reset_index(drop=True)
8989
# drop=True option avoids adding new index column with old index values
9090
~~~
9191
{: .language-python}
9292

9393
When we concatenate DataFrames, we need to specify the axis. `axis=0` tells
94-
pandas to stack the second DataFrame under the first one. It will automatically
94+
pandas to stack the second DataFrame UNDER the first one. It will automatically
9595
detect whether the column names are the same and will stack accordingly.
9696
`axis=1` will stack the columns in the second DataFrame to the RIGHT of the
9797
first DataFrame. To stack the data vertically, we need to make sure we have the
9898
same columns and associated column format in both datasets. When we stack
99-
horizonally, we want to make sure what we are doing makes sense (ie the data are
99+
horizontally, we want to make sure what we are doing makes sense (i.e. the data are
100100
related in some way).
101101

102102
~~~
@@ -225,7 +225,7 @@ identifier, which is called `species_id`.
225225

226226
Now that we know the fields with the common species ID attributes in each
227227
DataFrame, we are almost ready to join our data. However, since there are
228-
[different types of joins](http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/), we
228+
[different types of joins][join-types], we
229229
also need to decide which type of join makes sense for our analysis.
230230

231231
## Inner joins
@@ -236,16 +236,15 @@ two DataFrames based on a join key and returns a new DataFrame that contains
236236
DataFrames.
237237

238238
Inner joins yield a DataFrame that contains only rows where the value being
239-
joins exists in BOTH tables. An example of an inner join, adapted from [this
240-
page](http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/) is below:
239+
joined exists in BOTH tables. An example of an inner join, adapted from [Jeff Atwood's blogpost about SQL joins][join-types] is below:
241240

242241
![Inner join -- courtesy of codinghorror.com](../fig/inner-join.png)
243242

244243
The pandas function for performing joins is called `merge` and an Inner join is
245244
the default option:
246245

247246
~~~
248-
merged_inner = pd.merge(left=survey_sub,right=species_sub, left_on='species_id', right_on='species_id')
247+
merged_inner = pd.merge(left=survey_sub, right=species_sub, left_on='species_id', right_on='species_id')
249248
# In this case `species_id` is the only column name in both dataframes, so if we skipped `left_on`
250249
# And `right_on` arguments we would still get the same result
251250
@@ -326,7 +325,7 @@ A left join is performed in pandas by calling the same `merge` function used for
326325
inner join, but using the `how='left'` argument:
327326

328327
~~~
329-
merged_left = pd.merge(left=survey_sub,right=species_sub, how='left', left_on='species_id', right_on='species_id')
328+
merged_left = pd.merge(left=survey_sub, right=species_sub, how='left', left_on='species_id', right_on='species_id')
330329
merged_left
331330
~~~
332331
{: .language-python}
@@ -421,4 +420,6 @@ The pandas `merge` function supports two other join types:
421420
> the number of species in the plot / the total number of individuals in the plot = Biodiversity index.
422421
{: .challenge}
423422

423+
[join-types]: http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
424+
424425
{% include links.md %}

0 commit comments

Comments
 (0)