Skip to content

Commit b27584a

Browse files
committed
Revised exercises for Progamming section
1 parent 45450b9 commit b27584a

File tree

2 files changed

+248
-135
lines changed

2 files changed

+248
-135
lines changed

README.md

Lines changed: 122 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -33,11 +33,11 @@
3333
- <a href="#carpentries-version-group-by-split-apply-combine" id="toc-carpentries-version-group-by-split-apply-combine"><span class="toc-section-number">2.20</span> (Carpentries version) Group By: split-apply-combine</a>
3434
- <a href="#building-programs-week-3" id="toc-building-programs-week-3"><span class="toc-section-number">3</span> Building Programs (Week 3)</a>
3535
- <a href="#notebooks-vs-python-scripts" id="toc-notebooks-vs-python-scripts"><span class="toc-section-number">3.1</span> Notebooks vs Python scripts</a>
36-
- <a href="#python-from-the-terminal" id="toc-python-from-the-terminal"><span class="toc-section-number">3.2</span> Python from the terminal</a>
36+
- <a href="#optional-python-from-the-terminal" id="toc-optional-python-from-the-terminal"><span class="toc-section-number">3.2</span> (Optional) Python from the terminal</a>
3737
- <a href="#looping-over-data-sets" id="toc-looping-over-data-sets"><span class="toc-section-number">3.3</span> Looping Over Data Sets</a>
3838
- <a href="#conditionals" id="toc-conditionals"><span class="toc-section-number">3.4</span> Conditionals</a>
39-
- <a href="#generic-file-handling" id="toc-generic-file-handling"><span class="toc-section-number">3.5</span> Generic file handling</a>
40-
- <a href="#text-processing" id="toc-text-processing"><span class="toc-section-number">3.6</span> Text processing</a>
39+
- <a href="#optional-generic-file-handling" id="toc-optional-generic-file-handling"><span class="toc-section-number">3.5</span> (Optional) Generic file handling</a>
40+
- <a href="#optional-text-processing-and-data-cleanup" id="toc-optional-text-processing-and-data-cleanup"><span class="toc-section-number">3.6</span> (Optional) Text processing and data cleanup</a>
4141
- <a href="#writing-functions" id="toc-writing-functions"><span class="toc-section-number">3.7</span> Writing Functions</a>
4242
- <a href="#carpentries-version-conditionals" id="toc-carpentries-version-conditionals"><span class="toc-section-number">3.8</span> (Carpentries version) Conditionals</a>
4343
- <a href="#optional-variable-scope" id="toc-optional-variable-scope"><span class="toc-section-number">3.9</span> (Optional) Variable Scope</a>
@@ -1057,22 +1057,22 @@ Introductory documentation: <https://numpy.org/doc/stable/user/quickstart.html>
10571057
import numpy as np
10581058
10591059
# Create an array of random numbers
1060-
rand = np.random.rand(3, 4)
1061-
print(rand)
1060+
m_rand = np.random.rand(3, 4)
1061+
print(m_rand)
10621062
```
10631063
10641064
2. Arrays are indexed like lists
10651065
10661066
``` python
1067-
print(rand[0,0])
1067+
print(m_rand[0,0])
10681068
```
10691069
10701070
3. Arrays have attributes
10711071
10721072
``` python
1073-
print(rand.shape)
1074-
print(rand.size)
1075-
print(rand.ndim)
1073+
print(m_rand.shape)
1074+
print(m_rand.size)
1075+
print(m_rand.ndim)
10761076
```
10771077
10781078
4. Arrays are fast but inflexible - the entire array must be of a single type.
@@ -1362,6 +1362,13 @@ print(data.columns)
13621362
print(subset.max())
13631363
```
13641364
1365+
6. Insert new values using `.at` (for label indexing) or `.iat` (for numerical indexing)
1366+
1367+
``` python
1368+
subset.at["Italy", "1962"] = 2000
1369+
print(subset)
1370+
```
1371+
13651372
### **Challenge**: Collection types
13661373
13671374
1. Calculate `subset.max()` and assign the result to a variable. What kind of thing is it? What are its properties?
@@ -1766,45 +1773,60 @@ Scikit-Learn documentation: <https://scikit-learn.org/stable/>
17661773
17671774
``` python
17681775
from sklearn import linear_model
1769-
from sklearn.metrics import mean_squared_error, r2_score
17701776
17711777
# Create some random data
1772-
x = np.random.rand(10)
1773-
y = np.random.rand(10)
1778+
x_train = np.random.rand(20)
1779+
y = np.random.rand(20)
17741780
17751781
# Fit a linear model
17761782
reg = linear_model.LinearRegression()
1777-
reg.fit(x.reshape(-1,1), y)
1783+
reg.fit(x_train.reshape(-1,1), y)
17781784
print("Regression slope:", reg.coef_)
17791785
```
17801786
17811787
2. Estimate model fit
17821788
17831789
``` python
1784-
# Generate prediction data. This should properly be generated from hold-out X data.
1785-
y_prediction = reg.predict(x.reshape(-1,1))
1790+
from sklearn.metrics import r2_score
1791+
1792+
# Test model fit with new data
1793+
x_test = np.random.rand(20)
1794+
y_prediction = reg.predict(x_test.reshape(-1,1))
17861795
1796+
# Get model stats
17871797
mse = mean_squared_error(y, y_prediction)
17881798
r2 = r2_score(y, y_prediction)
17891799
1790-
print("Mean squared error:", "{:.3f}".format(mse))
17911800
print("R squared:", "{:.3f}".format(r2))
17921801
```
17931802
1794-
3. Inspect our prediction
1803+
3. (Optional) Inspect our prediction
17951804
17961805
``` python
17971806
import matplotlib.pyplot as plt
17981807
17991808
fig, ax = plt.subplots()
1800-
ax.scatter(x, y, color="black")
1801-
ax.plot(x, y_prediction, color="blue")
1809+
ax.scatter(x_train, y, color="black")
1810+
ax.plot(x_test, y_prediction, color="blue")
18021811
18031812
# `fig` in Jupyter Lab
18041813
fig.show()
18051814
```
18061815
1807-
### (Optional) Statsmodels regression example
1816+
4. (Optional) Compare with Statsmodels
1817+
1818+
``` python
1819+
# Load modules and data
1820+
import statsmodels.api as sm
1821+
1822+
# Fit and summarize OLS model (center data to get accurate model fit
1823+
mod = sm.OLS(y - y.mean(), x_train - x_train.mean())
1824+
res = mod.fit()
1825+
1826+
print(res.summary())
1827+
```
1828+
1829+
### (Optional) Statsmodels regression example with applied data
18081830
18091831
1. Import data
18101832
@@ -1946,7 +1968,7 @@ Broadly, a trade-off between managing big code bases and making it easy to exper
19461968
3. Version control
19471969
4. Remote scripts
19481970
1949-
## Python from the terminal
1971+
## (Optional) Python from the terminal
19501972
19511973
1. Python is an interactive interpreter (REPL)
19521974
@@ -2114,24 +2136,7 @@ else:
21142136
- Always associated with an `if`.
21152137
- Must come before the `else` (which is the “catch all”).
21162138
2117-
### (Optional) Conditionals are often used inside loops
2118-
2119-
Not much point using a conditional when we know the value (as above), but useful when we have a collection to process.
2120-
2121-
``` python
2122-
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
2123-
for m in masses:
2124-
if m > 9.0:
2125-
print(m, 'is HUGE')
2126-
elif m > 3.0:
2127-
print(m, 'is large')
2128-
else:
2129-
print(m, 'is small')
2130-
```
2131-
2132-
### <span class="todo TODO">TODO</span> Use enumeration to print occasional status messages for long-running processes
2133-
2134-
### (Optional) Conditions are tested once, in order
2139+
### Conditions are tested once, in order
21352140
21362141
Python steps through the branches of the conditional in order, testing each in turn. Order matters! The following is wrong:
21372142
@@ -2145,21 +2150,20 @@ elif grade >= 90:
21452150
print('grade is A')
21462151
```
21472152
2148-
### (Optional) Compound Relations Using `and`, `or`, and Parentheses
2153+
### Compound Relations Using `and`, `or`, and Parentheses
21492154
21502155
Often, you want some combination of things to be true. You can combine relations within a conditional using `and` and `or`. Continuing the example above, suppose you have:
21512156
21522157
``` python
21532158
mass = [ 3.54, 2.07, 9.22, 1.86, 1.71]
21542159
velocity = [10.00, 20.00, 30.00, 25.00, 20.00]
21552160
2156-
i = 0
2157-
for i in range(5):
2158-
if mass[i] > 5 and velocity[i] > 20:
2161+
for m, v in zip(mass, velocity):
2162+
if m > 5 and v > 20:
21592163
print("Fast heavy object. Duck!")
2160-
elif mass[i] > 2 and mass[i] <= 5 and velocity[i] <= 20:
2164+
elif m > 2 and m <= 5 and v <= 20:
21612165
print("Normal traffic")
2162-
elif mass[i] <= 2 and velocity[i] <= 20:
2166+
elif m <= 2 and v <= 20:
21632167
print("Slow light object. Ignore it")
21642168
else:
21652169
print("Whoa! Something is up with the data. Check it")
@@ -2168,32 +2172,74 @@ for i in range(5):
21682172
- Use () to group subsets of conditions
21692173
- Aside: For a more natural way of working with many lists, look at `zip()`
21702174
2175+
### Use the modulus to print occasional status messages
2176+
2177+
Conditionals are often used inside loops.
2178+
2179+
``` python
2180+
data_frames = []
2181+
for count, filename in enumerate(glob.glob('data/gapminder_[!all]*.csv')):
2182+
# Print every other filename
2183+
if count % 2 == 0:
2184+
print(count, filename)
2185+
data = pd.read_csv(filename)
2186+
data_frames.append(data)
2187+
2188+
all_data = pd.concat(data_frames)
2189+
print(all_data.shape)
2190+
```
2191+
2192+
### **Challenge**: Process small files
2193+
2194+
Iterate through all of the CSV files in the data directory. Print the file name and file length for any file that is less than 30 lines long.
2195+
2196+
#### Solution
2197+
2198+
``` python
2199+
for filename in glob.glob('data/*.csv'):
2200+
data = pd.read_csv(filename)
2201+
if len(data) < 30:
2202+
print(filename, len(data))
2203+
```
2204+
21712205
### (Optional) Use pathlib to write code that works across operating systems
21722206
21732207
1. Pathlib provides cross-platform path objects
21742208
21752209
``` python
21762210
from pathlib import Path
21772211
2178-
relative_path = Path("data") # data subdirectory
2179-
# relative_path = Path() # current directory
2180-
print("Relative path:", relative_path)
2181-
print("Absolute path:", relative_path.absolute())
2212+
# Create Path objects
2213+
raw_path = Path("data")
2214+
processed_path = Path("data/processed")
2215+
2216+
print("Relative path:", raw_path)
2217+
print("Absolute path:", raw_path.absolute())
21822218
```
21832219
21842220
2. The file objects have methods that provide much better information about files and directories.
21852221
21862222
``` python
21872223
#Note the careful testing at each level of the code.
2224+
data_frames = []
2225+
21882226
if relative_path.exists():
2189-
for filename in relative_path.glob('gapminder_*.csv'):
2227+
for filename in raw_path.glob('gapminder_[!all]*.csv'):
21902228
if filename.is_file():
21912229
data = pd.read_csv(filename)
21922230
print(filename)
2193-
print(data.head(1))
2231+
data_frames.append(data)
2232+
2233+
all_data = pd.concat(data_frames)
2234+
2235+
# Check for destination folder and create if it doesn't exist
2236+
if not processed_path.exists():
2237+
processed_path.mkdir()
2238+
2239+
all_data.to_csv(processed_path.joinpath("combined_data.csv"))
21942240
```
21952241
2196-
## Generic file handling
2242+
## (Optional) Generic file handling
21972243
21982244
Pandas understands specific file types, but what if you need to work with a generic file?
21992245
@@ -2235,7 +2281,7 @@ print(lines[0])
22352281
lines[0]
22362282
```
22372283
2238-
## Text processing
2284+
## (Optional) Text processing and data cleanup
22392285
22402286
### Use string methods to determine which lines to keep
22412287
@@ -2400,26 +2446,20 @@ print_greeting()
24002446
2. At the very end, with a final result
24012447
2. Docstring provides function help. Use triple quotes if you need the docstring to span multiple lines.
24022448
2403-
### **Challenge (option 1): Encapsulate text processing in a function**
2449+
### **Challenge (text processing)**: Encapsulate text processing in a function
24042450
24052451
Write a function that takes `line` as an input and returns the information required by `writer.writerow()`.
24062452
2407-
### **Challenge (option 2): Encapsulate data processing in a function**
2408-
2409-
Write a function that encapsulates the data normalization from the Pandas workshop into a function. The function should:
2453+
### **Challenge (data normalization)**: Encapsulate Z score calculations in a function
24102454
2411-
1. Take a data frame as its input
2412-
2. Calculate the mean Z score for each country
2413-
3. Divide countries into "wealthy" and "non-wealthy" categories
2414-
4. Add this information to the data frame as new columns
2415-
5. Return the modified data frame
2455+
1. Write a function that encapsulates the Z-score calculations from the Pandas workshop into a function. The function should return two Series:
2456+
1. The mean Z score for each country over time
2457+
2. A categorical variable that identifies countries as "wealthy" or "non-wealthy"
2458+
2. Use the function to inspect one of the Gapminder continental datasets.
24162459
24172460
#### Solution
24182461
24192462
``` python
2420-
import pandas as pd
2421-
import glob
2422-
24232463
def norm_data(data):
24242464
"""Add a Z score column to each data set."""
24252465
@@ -2432,17 +2472,29 @@ def norm_data(data):
24322472
# Group countries into "wealthy" (z > 0) and "not wealthy" (z <= 0)
24332473
z_bool = mean_z > 0
24342474
2435-
# Append to DataFrame
2436-
data["mean_z"] = mean_z
2437-
data["wealthy"] = z_bool
2475+
return mean_z, z_bool
2476+
2477+
data = pd.read_csv("data/gapminder_gdp_europe.csv", index_col = "country")
2478+
mean_z, z_bool = norm_data(data)
24382479
2480+
# If you need to drop the contintent column
2481+
# mean_z, z_bool = norm_data(data.drop("continent", axis=1))
2482+
```
2483+
2484+
#### (Optional) Use the function to process all files
2485+
2486+
``` python
24392487
for filename in glob.glob('data/gapminder_*.csv'):
24402488
# Print a status message
24412489
print("Current file:", filename)
24422490
24432491
# Read the data into a DataFrame and modify it
2444-
data = pd.read_csv(filename)
2445-
norm_data(data)
2492+
data = pd.read_csv(filename, index_col = "country")
2493+
mean_z, z_bool = norm_data(data)
2494+
2495+
# Append to DataFrame
2496+
data["mean_z"] = mean_z
2497+
data["wealthy"] = z_bool
24462498
24472499
# Generate an output file name
24482500
parts = filename.split(".csv")

0 commit comments

Comments
 (0)