You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+95-44
Original file line number
Diff line number
Diff line change
@@ -1309,7 +1309,7 @@ type(data.columns)
1309
1309
- You can convert data between NumPy arrays, Series, and DataFrames
1310
1310
- You can read data into any of the data structures from files or from standard Python containers
1311
1311
1312
-
### **Beginner Challenge**
1312
+
### **(Optional) Beginner Challenge**
1313
1313
1314
1314
1. Read the data in`gapminder_gdp_americas.csv` into a variable called `americas` and display its summary statistics.
1315
1315
2. After reading the data for the Americas, use `help(americas.head)` and `help(americas.tail)` to find out what `DataFrame.head` and `DataFrame.tail` do.
### Use list slicing notation to get subsets of the data frame
1365
1382
@@ -1383,7 +1400,7 @@ print(data.columns)
1383
1400
data.loc[['Italy','Poland'], :]
1384
1401
```
1385
1402
1386
-
4. `.iloc` follows list index conventions ("up to, but not including)", but `.loc` does the intuitive right thing ("A through B")
1403
+
4. (Optional) `.iloc` follows list index conventions ("up to, but not including)", but `.loc` does the intuitive right thing ("A through B")
1387
1404
1388
1405
``` python
1389
1406
index_subset = data.iloc[0:2, 0:2]
@@ -1402,7 +1419,7 @@ print(data.columns)
1402
1419
print(subset.max())
1403
1420
```
1404
1421
1405
-
6. Insert new values using `.at` (for label indexing) or `.iat` (for numerical indexing)
1422
+
6. (Optional) Insert new values using `.at` (for label indexing) or `.iat` (for numerical indexing)
1406
1423
1407
1424
``` python
1408
1425
subset.at["Italy", "1962"] = 2000
@@ -1428,6 +1445,9 @@ print(data.columns)
1428
1445
1429
1446
``` python
1430
1447
print(subset.max().max())
1448
+
1449
+
# Alternatively
1450
+
print(subset.max(axis=None))
1431
1451
```
1432
1452
1433
1453
### (Optional) Filter on label properties
@@ -1462,42 +1482,46 @@ print(data.columns)
1462
1482
2. Use the criterion match to filter the data frame's contents. This uses index notation:
1463
1483
1464
1484
``` python
1465
-
fs = subset[subset > 10000]
1466
-
print(fs)
1485
+
df = subset[subset > 10000]
1486
+
print(df)
1467
1487
```
1468
1488
1469
1489
1. `subset > 10000` returns a data frame of True/False values
1470
-
2. `subset[subset > 10000]` filters its contents based on that True/False data frame
1490
+
2. `subset[subset > 10000]` filters its contents based on that True/False data frame. All `True` values are returned, element-wise.
1471
1491
3. This section is more properly called "Masking Data," because it involves operations for overlaying a data frame's values without changing the data frame's shape. We don't drop anything from the data frame, we just replace it with `NaN`.
1472
1492
1473
1493
3. (Optional) Use `.where()` method to find elements that match the criterion:
1474
1494
1475
1495
``` python
1476
-
fs = subset.where(subset > 10000)
1477
-
print(fs)
1496
+
df = subset.where(subset > 10000)
1497
+
print(df)
1478
1498
```
1479
1499
1480
1500
### You can filter using any method that returns a data frame
1481
1501
1502
+
For example, get the GDP for all countries greater than the median.
1503
+
1482
1504
``` python
1483
-
# GDP for all countries greater than the median
1484
-
subset[subset >subset.median()]
1505
+
# Get the overall median
1506
+
subset.median() # Returns Series
1507
+
subset.median(axis=None) # Returns single valuey
1485
1508
1486
-
# OR: subset.where(subset > subset.median())
1509
+
# Which data points are above the median
1510
+
subset > subset.median(axis=None)
1511
+
1512
+
# Return the masked data set
1513
+
subset[subset > subset.median(axis=None)]
1487
1514
```
1488
1515
1489
1516
### Use method chaining to create final output without creating intermediate variables
1490
1517
1491
1518
``` python
1492
1519
# The .rank() method turns numerical scores into ranks
1493
-
subset.rank()
1494
-
```
1520
+
data.rank()
1495
1521
1496
-
``` python
1497
-
# GDP ranking for all countries greater than the median
2. Create a new data frame that filters out all numbers lower than the random numbers
1626
+
1627
+
3. Interpolate new values forthe missing valuesin the new data frame. How accurate do you think they are?
1628
+
1629
+
#### Solution
1630
+
1631
+
``` python
1632
+
new_data = data[data > random_filter]
1633
+
1634
+
# Data is not missing randomly
1635
+
print(new_data)
1636
+
1637
+
new_data.interpolate()
1638
+
new_data.interpolate().mean(axis=None)
1591
1639
```
1592
1640
1593
-
### **Challenge: Filter and trim with a boolean vector**
1641
+
### **(Optional) Challenge: Filter and trim with a boolean vector**
1594
1642
1595
1643
A DataFrame is a dictionary of Series columns. With this in mind, experiment with the following code and try to explain what each line is doing. What operation is it performing, and what is being returned?
1596
1644
1597
1645
Feel free to use `print()`, `help()`, `type()`, etc as you investigate.
1598
1646
1599
1647
``` python
1600
-
fs["1962"]
1601
-
fs["1962"].notna()
1602
-
fs[fs["1962"].notna()]
1648
+
df["1962"]
1649
+
df["1962"].notna()
1650
+
df[df["1962"].notna()]
1603
1651
```
1604
1652
1605
1653
#### Solution
@@ -1614,7 +1662,10 @@ fs[fs["1962"].notna()]
1614
1662
1615
1663
``` python
1616
1664
# Calculate z scores for all elements
1617
-
z = (data - data.mean())/data.std()
1665
+
# z = (data - data.mean(axis=None))/data.std()
1666
+
# As of July 2024, pandas dataframe.std(axis=None) doesn't work. We are dropping down to
1667
+
# Numpy to use the .std() method on the underlying values array.
1668
+
z = (data - data.mean(axis=None))/data.values.std(ddof=1)
1618
1669
1619
1670
# Get the mean z score for each country (i.e. across all columns)
0 commit comments