Skip to content

Commit 45450b9

Browse files
committed
Added section on text processing in Pandas
1 parent 3805555 commit 45450b9

File tree

2 files changed

+108
-6
lines changed

2 files changed

+108
-6
lines changed

README.md

Lines changed: 60 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -24,12 +24,13 @@
2424
- <a href="#sorting-and-grouping" id="toc-sorting-and-grouping"><span class="toc-section-number">2.11</span> Sorting and grouping</a>
2525
- <a href="#write-output" id="toc-write-output"><span class="toc-section-number">2.12</span> Write output</a>
2626
- <a href="#working-with-multiple-tables" id="toc-working-with-multiple-tables"><span class="toc-section-number">2.13</span> Working with multiple tables</a>
27-
- <a href="#optional-adding-rows-to-dataframes" id="toc-optional-adding-rows-to-dataframes"><span class="toc-section-number">2.14</span> (Optional) Adding rows to DataFrames</a>
28-
- <a href="#optional-scientific-computing-libraries" id="toc-optional-scientific-computing-libraries"><span class="toc-section-number">2.15</span> (Optional) Scientific Computing Libraries</a>
29-
- <a href="#optional-things-we-didnt-talk-about" id="toc-optional-things-we-didnt-talk-about"><span class="toc-section-number">2.16</span> (Optional) Things we didn't talk about</a>
30-
- <a href="#optional-pandas-method-chaining-in-the-wild" id="toc-optional-pandas-method-chaining-in-the-wild"><span class="toc-section-number">2.17</span> (Optional) Pandas method chaining in the wild</a>
31-
- <a href="#optional-introspecting-on-the-dataframe-object" id="toc-optional-introspecting-on-the-dataframe-object"><span class="toc-section-number">2.18</span> (Optional) Introspecting on the DataFrame object</a>
32-
- <a href="#carpentries-version-group-by-split-apply-combine" id="toc-carpentries-version-group-by-split-apply-combine"><span class="toc-section-number">2.19</span> (Carpentries version) Group By: split-apply-combine</a>
27+
- <a href="#optional-text-processing-in-pandas" id="toc-optional-text-processing-in-pandas"><span class="toc-section-number">2.14</span> (Optional) Text processing in Pandas</a>
28+
- <a href="#optional-adding-rows-to-dataframes" id="toc-optional-adding-rows-to-dataframes"><span class="toc-section-number">2.15</span> (Optional) Adding rows to DataFrames</a>
29+
- <a href="#optional-scientific-computing-libraries" id="toc-optional-scientific-computing-libraries"><span class="toc-section-number">2.16</span> (Optional) Scientific Computing Libraries</a>
30+
- <a href="#optional-things-we-didnt-talk-about" id="toc-optional-things-we-didnt-talk-about"><span class="toc-section-number">2.17</span> (Optional) Things we didn't talk about</a>
31+
- <a href="#optional-pandas-method-chaining-in-the-wild" id="toc-optional-pandas-method-chaining-in-the-wild"><span class="toc-section-number">2.18</span> (Optional) Pandas method chaining in the wild</a>
32+
- <a href="#optional-introspecting-on-the-dataframe-object" id="toc-optional-introspecting-on-the-dataframe-object"><span class="toc-section-number">2.19</span> (Optional) Introspecting on the DataFrame object</a>
33+
- <a href="#carpentries-version-group-by-split-apply-combine" id="toc-carpentries-version-group-by-split-apply-combine"><span class="toc-section-number">2.20</span> (Carpentries version) Group By: split-apply-combine</a>
3334
- <a href="#building-programs-week-3" id="toc-building-programs-week-3"><span class="toc-section-number">3</span> Building Programs (Week 3)</a>
3435
- <a href="#notebooks-vs-python-scripts" id="toc-notebooks-vs-python-scripts"><span class="toc-section-number">3.1</span> Notebooks vs Python scripts</a>
3536
- <a href="#python-from-the-terminal" id="toc-python-from-the-terminal"><span class="toc-section-number">3.2</span> Python from the terminal</a>
@@ -1666,6 +1667,59 @@ print(df3.shape)
16661667
print(df_birds.shape)
16671668
```
16681669
1670+
## (Optional) Text processing in Pandas
1671+
1672+
cf. <https://pandas.pydata.org/docs/user_guide/text.html>
1673+
1674+
1. Import tabular data that contains strings
1675+
1676+
``` python
1677+
species = pd.read_csv('data/species.csv', index_col='species_id')
1678+
1679+
# You can explicitly set all of the columns to type string
1680+
# species = pd.read_csv('data/species.csv', index_col='species_id', dtype='string')
1681+
1682+
# ...or specify the type of individual columns
1683+
# species = pd.read_csv('data/species.csv', index_col='species_id',
1684+
# dtype = {"genus": "string",
1685+
# "species": "string",
1686+
# "taxa": "string"})
1687+
1688+
print(species.head())
1689+
print(species.info())
1690+
print(species.describe())
1691+
```
1692+
1693+
2. A Pandas Series has string methods that operate on the entire Series at once
1694+
1695+
``` python
1696+
# Two ways of getting an individual column
1697+
print(type(species.genus))
1698+
print(type(species["genus"]))
1699+
1700+
# Inspect the available string methods
1701+
print(dir(species["genus"].str))
1702+
```
1703+
1704+
3. Use string methods for filtering
1705+
1706+
``` python
1707+
# Which species are in the taxa "Bird"?
1708+
print(species["taxa"].str.startswith("Bird"))
1709+
1710+
# Filter the dataset to only look at Birds
1711+
print(species[species["taxa"].str.startswith("Bird")])
1712+
```
1713+
1714+
4. Use string methods to transform and combine data
1715+
1716+
``` python
1717+
binomial_name = species["genus"].str.cat(species["species"].str.title(), " ")
1718+
species["binomial"] = binomial_name
1719+
1720+
print(species.head())
1721+
```
1722+
16691723
## (Optional) Adding rows to DataFrames
16701724
16711725
A row is a view onto the *nth* item of each of the column Series. Appending rows is a performance bottleneck because it requires a separate append operation for each Series. You should concatenate data frames instead.

README.org

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1364,6 +1364,54 @@ print(df3.shape)
13641364
print(df_birds.shape)
13651365
#+END_SRC
13661366

1367+
** (Optional) Text processing in Pandas
1368+
cf. https://pandas.pydata.org/docs/user_guide/text.html
1369+
1370+
1. Import tabular data that contains strings
1371+
#+BEGIN_SRC python
1372+
species = pd.read_csv('data/species.csv', index_col='species_id')
1373+
1374+
# You can explicitly set all of the columns to type string
1375+
# species = pd.read_csv('data/species.csv', index_col='species_id', dtype='string')
1376+
1377+
# ...or specify the type of individual columns
1378+
# species = pd.read_csv('data/species.csv', index_col='species_id',
1379+
# dtype = {"genus": "string",
1380+
# "species": "string",
1381+
# "taxa": "string"})
1382+
1383+
print(species.head())
1384+
print(species.info())
1385+
print(species.describe())
1386+
#+END_SRC
1387+
1388+
2. A Pandas Series has string methods that operate on the entire Series at once
1389+
#+BEGIN_SRC python
1390+
# Two ways of getting an individual column
1391+
print(type(species.genus))
1392+
print(type(species["genus"]))
1393+
1394+
# Inspect the available string methods
1395+
print(dir(species["genus"].str))
1396+
#+END_SRC
1397+
1398+
3. Use string methods for filtering
1399+
#+BEGIN_SRC python
1400+
# Which species are in the taxa "Bird"?
1401+
print(species["taxa"].str.startswith("Bird"))
1402+
1403+
# Filter the dataset to only look at Birds
1404+
print(species[species["taxa"].str.startswith("Bird")])
1405+
#+END_SRC
1406+
1407+
4. Use string methods to transform and combine data
1408+
#+BEGIN_SRC python
1409+
binomial_name = species["genus"].str.cat(species["species"].str.title(), " ")
1410+
species["binomial"] = binomial_name
1411+
1412+
print(species.head())
1413+
#+END_SRC
1414+
13671415
** (Optional) Adding rows to DataFrames
13681416
A row is a view onto the /nth/ item of each of the column Series. Appending rows is a performance bottleneck because it requires a separate append operation for each Series. You should concatenate data frames instead.
13691417

0 commit comments

Comments
 (0)