|
24 | 24 | - <a href="#sorting-and-grouping" id="toc-sorting-and-grouping"><span class="toc-section-number">2.11</span> Sorting and grouping</a>
|
25 | 25 | - <a href="#write-output" id="toc-write-output"><span class="toc-section-number">2.12</span> Write output</a>
|
26 | 26 | - <a href="#working-with-multiple-tables" id="toc-working-with-multiple-tables"><span class="toc-section-number">2.13</span> Working with multiple tables</a>
|
27 |
| - - <a href="#optional-adding-rows-to-dataframes" id="toc-optional-adding-rows-to-dataframes"><span class="toc-section-number">2.14</span> (Optional) Adding rows to DataFrames</a> |
28 |
| - - <a href="#optional-scientific-computing-libraries" id="toc-optional-scientific-computing-libraries"><span class="toc-section-number">2.15</span> (Optional) Scientific Computing Libraries</a> |
29 |
| - - <a href="#optional-things-we-didnt-talk-about" id="toc-optional-things-we-didnt-talk-about"><span class="toc-section-number">2.16</span> (Optional) Things we didn't talk about</a> |
30 |
| - - <a href="#optional-pandas-method-chaining-in-the-wild" id="toc-optional-pandas-method-chaining-in-the-wild"><span class="toc-section-number">2.17</span> (Optional) Pandas method chaining in the wild</a> |
31 |
| - - <a href="#optional-introspecting-on-the-dataframe-object" id="toc-optional-introspecting-on-the-dataframe-object"><span class="toc-section-number">2.18</span> (Optional) Introspecting on the DataFrame object</a> |
32 |
| - - <a href="#carpentries-version-group-by-split-apply-combine" id="toc-carpentries-version-group-by-split-apply-combine"><span class="toc-section-number">2.19</span> (Carpentries version) Group By: split-apply-combine</a> |
| 27 | + - <a href="#optional-text-processing-in-pandas" id="toc-optional-text-processing-in-pandas"><span class="toc-section-number">2.14</span> (Optional) Text processing in Pandas</a> |
| 28 | + - <a href="#optional-adding-rows-to-dataframes" id="toc-optional-adding-rows-to-dataframes"><span class="toc-section-number">2.15</span> (Optional) Adding rows to DataFrames</a> |
| 29 | + - <a href="#optional-scientific-computing-libraries" id="toc-optional-scientific-computing-libraries"><span class="toc-section-number">2.16</span> (Optional) Scientific Computing Libraries</a> |
| 30 | + - <a href="#optional-things-we-didnt-talk-about" id="toc-optional-things-we-didnt-talk-about"><span class="toc-section-number">2.17</span> (Optional) Things we didn't talk about</a> |
| 31 | + - <a href="#optional-pandas-method-chaining-in-the-wild" id="toc-optional-pandas-method-chaining-in-the-wild"><span class="toc-section-number">2.18</span> (Optional) Pandas method chaining in the wild</a> |
| 32 | + - <a href="#optional-introspecting-on-the-dataframe-object" id="toc-optional-introspecting-on-the-dataframe-object"><span class="toc-section-number">2.19</span> (Optional) Introspecting on the DataFrame object</a> |
| 33 | + - <a href="#carpentries-version-group-by-split-apply-combine" id="toc-carpentries-version-group-by-split-apply-combine"><span class="toc-section-number">2.20</span> (Carpentries version) Group By: split-apply-combine</a> |
33 | 34 | - <a href="#building-programs-week-3" id="toc-building-programs-week-3"><span class="toc-section-number">3</span> Building Programs (Week 3)</a>
|
34 | 35 | - <a href="#notebooks-vs-python-scripts" id="toc-notebooks-vs-python-scripts"><span class="toc-section-number">3.1</span> Notebooks vs Python scripts</a>
|
35 | 36 | - <a href="#python-from-the-terminal" id="toc-python-from-the-terminal"><span class="toc-section-number">3.2</span> Python from the terminal</a>
|
@@ -1666,6 +1667,59 @@ print(df3.shape)
|
1666 | 1667 | print(df_birds.shape)
|
1667 | 1668 | ```
|
1668 | 1669 |
|
| 1670 | +## (Optional) Text processing in Pandas |
| 1671 | +
|
| 1672 | +cf. <https://pandas.pydata.org/docs/user_guide/text.html> |
| 1673 | +
|
| 1674 | +1. Import tabular data that contains strings |
| 1675 | +
|
| 1676 | + ``` python |
| 1677 | + species = pd.read_csv('data/species.csv', index_col='species_id') |
| 1678 | +
|
| 1679 | + # You can explicitly set all of the columns to type string |
| 1680 | + # species = pd.read_csv('data/species.csv', index_col='species_id', dtype='string') |
| 1681 | +
|
| 1682 | + # ...or specify the type of individual columns |
| 1683 | + # species = pd.read_csv('data/species.csv', index_col='species_id', |
| 1684 | + # dtype = {"genus": "string", |
| 1685 | + # "species": "string", |
| 1686 | + # "taxa": "string"}) |
| 1687 | +
|
| 1688 | + print(species.head()) |
| 1689 | + print(species.info()) |
| 1690 | + print(species.describe()) |
| 1691 | + ``` |
| 1692 | +
|
| 1693 | +2. A Pandas Series has string methods that operate on the entire Series at once |
| 1694 | +
|
| 1695 | + ``` python |
| 1696 | + # Two ways of getting an individual column |
| 1697 | + print(type(species.genus)) |
| 1698 | + print(type(species["genus"])) |
| 1699 | +
|
| 1700 | + # Inspect the available string methods |
| 1701 | + print(dir(species["genus"].str)) |
| 1702 | + ``` |
| 1703 | +
|
| 1704 | +3. Use string methods for filtering |
| 1705 | +
|
| 1706 | + ``` python |
| 1707 | + # Which species are in the taxa "Bird"? |
| 1708 | + print(species["taxa"].str.startswith("Bird")) |
| 1709 | +
|
| 1710 | + # Filter the dataset to only look at Birds |
| 1711 | + print(species[species["taxa"].str.startswith("Bird")]) |
| 1712 | + ``` |
| 1713 | +
|
| 1714 | +4. Use string methods to transform and combine data |
| 1715 | +
|
| 1716 | + ``` python |
| 1717 | + binomial_name = species["genus"].str.cat(species["species"].str.title(), " ") |
| 1718 | + species["binomial"] = binomial_name |
| 1719 | +
|
| 1720 | + print(species.head()) |
| 1721 | + ``` |
| 1722 | +
|
1669 | 1723 | ## (Optional) Adding rows to DataFrames
|
1670 | 1724 |
|
1671 | 1725 | A row is a view onto the *nth* item of each of the column Series. Appending rows is a performance bottleneck because it requires a separate append operation for each Series. You should concatenate data frames instead.
|
|
0 commit comments