Skip to content

Commit 137e79f

Browse files
committed
Updated Pandas lesson
1 parent f893773 commit 137e79f

File tree

2 files changed

+184
-88
lines changed

2 files changed

+184
-88
lines changed

README.md

+95-44
Original file line numberDiff line numberDiff line change
@@ -1309,7 +1309,7 @@ type(data.columns)
13091309
- You can convert data between NumPy arrays, Series, and DataFrames
13101310
- You can read data into any of the data structures from files or from standard Python containers
13111311
1312-
### **Beginner Challenge**
1312+
### **(Optional) Beginner Challenge**
13131313
13141314
1. Read the data in `gapminder_gdp_americas.csv` into a variable called `americas` and display its summary statistics.
13151315
2. After reading the data for the Americas, use `help(americas.head)` and `help(americas.tail)` to find out what `DataFrame.head` and `DataFrame.tail` do.
@@ -1334,7 +1334,7 @@ americas.to_csv('processed.csv')
13341334
Use `DataFrame.iloc[..., ...]` to select values by their (entry) position. The `i` in `iloc` stands for "index".
13351335
13361336
``` python
1337-
import pandas as pd
1337+
#import pandas as pd
13381338
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
13391339
13401340
data.iloc[0,0]
@@ -1353,13 +1353,30 @@ This is most common way to get data
13531353
13541354
### Shorten the column names using vectorized string methods
13551355
1356-
``` python
1357-
print(data.columns)
1356+
1. Standard Python has string methods
13581357
1359-
# The columns index can update all of its values in a single operation
1360-
data.columns = data.columns.str.strip("gdpPercap_")
1361-
print(data.columns)
1362-
```
1358+
``` python
1359+
big_hello = "hello".title()
1360+
print(big_hello)
1361+
1362+
help("hello".title)
1363+
print(dir("hello"))
1364+
```
1365+
1366+
2. Pandas data frames are complex objects
1367+
1368+
``` python
1369+
print(data.columns)
1370+
print(dir(data.columns.str))
1371+
```
1372+
1373+
3. Use built-in methods to transform the entire data frame
1374+
1375+
``` python
1376+
# The columns index can update all of its values in a single operation
1377+
data.columns = data.columns.str.strip("gdpPercap_")
1378+
print(data.columns)
1379+
```
13631380
13641381
### Use list slicing notation to get subsets of the data frame
13651382
@@ -1383,7 +1400,7 @@ print(data.columns)
13831400
data.loc[['Italy','Poland'], :]
13841401
```
13851402
1386-
4. `.iloc` follows list index conventions ("up to, but not including)", but `.loc` does the intuitive right thing ("A through B")
1403+
4. (Optional) `.iloc` follows list index conventions ("up to, but not including)", but `.loc` does the intuitive right thing ("A through B")
13871404
13881405
``` python
13891406
index_subset = data.iloc[0:2, 0:2]
@@ -1402,7 +1419,7 @@ print(data.columns)
14021419
print(subset.max())
14031420
```
14041421
1405-
6. Insert new values using `.at` (for label indexing) or `.iat` (for numerical indexing)
1422+
6. (Optional) Insert new values using `.at` (for label indexing) or `.iat` (for numerical indexing)
14061423
14071424
``` python
14081425
subset.at["Italy", "1962"] = 2000
@@ -1428,6 +1445,9 @@ print(data.columns)
14281445
14291446
``` python
14301447
print(subset.max().max())
1448+
1449+
# Alternatively
1450+
print(subset.max(axis=None))
14311451
```
14321452
14331453
### (Optional) Filter on label properties
@@ -1462,42 +1482,46 @@ print(data.columns)
14621482
2. Use the criterion match to filter the data frame's contents. This uses index notation:
14631483
14641484
``` python
1465-
fs = subset[subset > 10000]
1466-
print(fs)
1485+
df = subset[subset > 10000]
1486+
print(df)
14671487
```
14681488
14691489
1. `subset > 10000` returns a data frame of True/False values
1470-
2. `subset[subset > 10000]` filters its contents based on that True/False data frame
1490+
2. `subset[subset > 10000]` filters its contents based on that True/False data frame. All `True` values are returned, element-wise.
14711491
3. This section is more properly called "Masking Data," because it involves operations for overlaying a data frame's values without changing the data frame's shape. We don't drop anything from the data frame, we just replace it with `NaN`.
14721492
14731493
3. (Optional) Use `.where()` method to find elements that match the criterion:
14741494
14751495
``` python
1476-
fs = subset.where(subset > 10000)
1477-
print(fs)
1496+
df = subset.where(subset > 10000)
1497+
print(df)
14781498
```
14791499
14801500
### You can filter using any method that returns a data frame
14811501
1502+
For example, get the GDP for all countries greater than the median.
1503+
14821504
``` python
1483-
# GDP for all countries greater than the median
1484-
subset[subset > subset.median()]
1505+
# Get the overall median
1506+
subset.median() # Returns Series
1507+
subset.median(axis=None) # Returns single valuey
14851508
1486-
# OR: subset.where(subset > subset.median())
1509+
# Which data points are above the median
1510+
subset > subset.median(axis=None)
1511+
1512+
# Return the masked data set
1513+
subset[subset > subset.median(axis=None)]
14871514
```
14881515
14891516
### Use method chaining to create final output without creating intermediate variables
14901517
14911518
``` python
14921519
# The .rank() method turns numerical scores into ranks
1493-
subset.rank()
1494-
```
1520+
data.rank()
14951521
1496-
``` python
1497-
# GDP ranking for all countries greater than the median
1498-
subset[subset > subset.median()].rank()
1499-
1500-
# OR: subset.where(subset > subset.median()).rank()
1522+
# Get mean rank over time and sort the output
1523+
mean_rank = data.rank().mean(axis=1).sort_values()
1524+
print(mean_rank)
15011525
```
15021526
15031527
## Working with missing data
@@ -1510,20 +1534,20 @@ Examples include min, max, mean, std, etc.
15101534
15111535
``` python
15121536
print("Column means")
1513-
print(fs.mean())
1537+
print(df.mean())
15141538
15151539
print("Row means")
1516-
print(fs.mean(axis=1))
1540+
print(df.mean(axis=1))
15171541
```
15181542
15191543
2. Force inclusions with the `skipna` argument
15201544
15211545
``` python
15221546
print("Column means")
1523-
print(fs.mean(skipna=False))
1547+
print(df.mean(skipna=False))
15241548
15251549
print("Row means")
1526-
print(fs.mean(axis=1, skipna=False))
1550+
print(df.mean(axis=1, skipna=False))
15271551
```
15281552
15291553
### Check for missing values
@@ -1532,41 +1556,41 @@ Examples include min, max, mean, std, etc.
15321556
15331557
``` python
15341558
# Show which items are NA
1535-
fs.isna()
1559+
df.isna()
15361560
```
15371561
15381562
2. Count missing values
15391563
15401564
``` python
15411565
# Missing by row
1542-
print(fs.isna().sum())
1566+
print(df.isna().sum())
15431567
15441568
# Missing by column
1545-
print(fs.isna().sum(axis=1))
1569+
print(df.isna().sum(axis=1))
15461570
15471571
# Aggregate sum
1548-
fs.isna().sum().sum()
1572+
df.isna().sum().sum()
15491573
```
15501574
15511575
3. Are any values missing?
15521576
15531577
``` python
1554-
fs.isna().any(axis=None)
1578+
df.isna().any(axis=None)
15551579
```
15561580
15571581
4. (Optional) Are all of the values missing?
15581582
15591583
``` python
1560-
fs.isna().all(axis=None)
1584+
df.isna().all(axis=None)
15611585
```
15621586
15631587
### Replace missing values
15641588
15651589
1. Replace with a fixed value
15661590
15671591
``` python
1568-
fs_fixed = fs.fillna(99)
1569-
print(fs_fixed)
1592+
df_fixed = df.fillna(99)
1593+
print(df_fixed)
15701594
```
15711595
15721596
2. Replace values that don't meet a criterion with an alternate value
@@ -1579,27 +1603,51 @@ Examples include min, max, mean, std, etc.
15791603
3. (Optional) Impute missing values. Read the docs, this may or may not be sufficient for your needs.
15801604
15811605
``` python
1582-
fs_imputed = fs.interpolate()
1606+
df_imputed = df.interpolate()
15831607
```
15841608
15851609
### Drop missing values
15861610
15871611
Drop all rows with missing values
15881612
15891613
``` python
1590-
fs_drop = fs.dropna()
1614+
df_drop = df.dropna()
1615+
```
1616+
1617+
### **Challenge: The perils of missing data**
1618+
1619+
1. Create an array of random numbers matching the `data` data frame
1620+
1621+
``` python
1622+
random_filter = np.random.rand(30, 12) * data.max(axis=None)
1623+
```
1624+
1625+
2. Create a new data frame that filters out all numbers lower than the random numbers
1626+
1627+
3. Interpolate new values for the missing values in the new data frame. How accurate do you think they are?
1628+
1629+
#### Solution
1630+
1631+
``` python
1632+
new_data = data[data > random_filter]
1633+
1634+
# Data is not missing randomly
1635+
print(new_data)
1636+
1637+
new_data.interpolate()
1638+
new_data.interpolate().mean(axis=None)
15911639
```
15921640
1593-
### **Challenge: Filter and trim with a boolean vector**
1641+
### **(Optional) Challenge: Filter and trim with a boolean vector**
15941642
15951643
A DataFrame is a dictionary of Series columns. With this in mind, experiment with the following code and try to explain what each line is doing. What operation is it performing, and what is being returned?
15961644
15971645
Feel free to use `print()`, `help()`, `type()`, etc as you investigate.
15981646
15991647
``` python
1600-
fs["1962"]
1601-
fs["1962"].notna()
1602-
fs[fs["1962"].notna()]
1648+
df["1962"]
1649+
df["1962"].notna()
1650+
df[df["1962"].notna()]
16031651
```
16041652
16051653
#### Solution
@@ -1614,7 +1662,10 @@ fs[fs["1962"].notna()]
16141662
16151663
``` python
16161664
# Calculate z scores for all elements
1617-
z = (data - data.mean())/data.std()
1665+
# z = (data - data.mean(axis=None))/data.std()
1666+
# As of July 2024, pandas dataframe.std(axis=None) doesn't work. We are dropping down to
1667+
# Numpy to use the .std() method on the underlying values array.
1668+
z = (data - data.mean(axis=None))/data.values.std(ddof=1)
16181669
16191670
# Get the mean z score for each country (i.e. across all columns)
16201671
mean_z = z.mean(axis=1)

0 commit comments

Comments
 (0)