You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- <ahref="#carpentries-version-group-by-split-apply-combine"id="toc-carpentries-version-group-by-split-apply-combine"><spanclass="toc-section-number">2.20</span> (Carpentries version) Group By: split-apply-combine</a>
34
34
- <ahref="#building-programs-week-3"id="toc-building-programs-week-3"><spanclass="toc-section-number">3</span> Building Programs (Week 3)</a>
35
35
- <ahref="#notebooks-vs-python-scripts"id="toc-notebooks-vs-python-scripts"><spanclass="toc-section-number">3.1</span> Notebooks vs Python scripts</a>
36
-
- <ahref="#python-from-the-terminal"id="toc-python-from-the-terminal"><spanclass="toc-section-number">3.2</span> Python from the terminal</a>
36
+
- <ahref="#optional-python-from-the-terminal"id="toc-optional-python-from-the-terminal"><spanclass="toc-section-number">3.2</span> (Optional) Python from the terminal</a>
37
37
- <ahref="#looping-over-data-sets"id="toc-looping-over-data-sets"><spanclass="toc-section-number">3.3</span> Looping Over Data Sets</a>
- <ahref="#optional-text-processing-and-data-cleanup"id="toc-optional-text-processing-and-data-cleanup"><spanclass="toc-section-number">3.6</span> (Optional) Text processing and data cleanup</a>
# Fit and summarize OLS model (center data to get accurate model fit
1823
+
mod = sm.OLS(y - y.mean(), x_train - x_train.mean())
1824
+
res = mod.fit()
1825
+
1826
+
print(res.summary())
1827
+
```
1828
+
1829
+
### (Optional) Statsmodels regression example with applied data
1808
1830
1809
1831
1. Import data
1810
1832
@@ -1946,7 +1968,7 @@ Broadly, a trade-off between managing big code bases and making it easy to exper
1946
1968
3. Version control
1947
1969
4. Remote scripts
1948
1970
1949
-
## Python from the terminal
1971
+
## (Optional) Python from the terminal
1950
1972
1951
1973
1. Python is an interactive interpreter (REPL)
1952
1974
@@ -2114,24 +2136,7 @@ else:
2114
2136
- Always associated with an `if`.
2115
2137
- Must come before the `else` (which is the “catch all”).
2116
2138
2117
-
### (Optional) Conditionals are often used inside loops
2118
-
2119
-
Not much point using a conditional when we know the value (as above), but useful when we have a collection to process.
2120
-
2121
-
``` python
2122
-
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
2123
-
for m in masses:
2124
-
if m > 9.0:
2125
-
print(m, 'is HUGE')
2126
-
elif m > 3.0:
2127
-
print(m, 'is large')
2128
-
else:
2129
-
print(m, 'is small')
2130
-
```
2131
-
2132
-
### <span class="todo TODO">TODO</span> Use enumeration to print occasional status messages for long-running processes
2133
-
2134
-
### (Optional) Conditions are tested once, in order
2139
+
### Conditions are tested once, in order
2135
2140
2136
2141
Python steps through the branches of the conditional in order, testing each in turn. Order matters! The following is wrong:
2137
2142
@@ -2145,21 +2150,20 @@ elif grade >= 90:
2145
2150
print('grade is A')
2146
2151
```
2147
2152
2148
-
### (Optional) Compound Relations Using `and`, `or`, and Parentheses
2153
+
### Compound Relations Using `and`, `or`, and Parentheses
2149
2154
2150
2155
Often, you want some combination of things to be true. You can combine relations within a conditional using `and` and `or`. Continuing the example above, suppose you have:
2151
2156
2152
2157
``` python
2153
2158
mass = [ 3.54, 2.07, 9.22, 1.86, 1.71]
2154
2159
velocity = [10.00, 20.00, 30.00, 25.00, 20.00]
2155
2160
2156
-
i = 0
2157
-
for i in range(5):
2158
-
if mass[i] > 5 and velocity[i] > 20:
2161
+
for m, v in zip(mass, velocity):
2162
+
if m > 5 and v > 20:
2159
2163
print("Fast heavy object. Duck!")
2160
-
elif mass[i] > 2 and mass[i] <= 5 and velocity[i] <= 20:
2164
+
elif m > 2 and m <= 5 and v <= 20:
2161
2165
print("Normal traffic")
2162
-
elif mass[i] <= 2 and velocity[i] <= 20:
2166
+
elif m <= 2 and v <= 20:
2163
2167
print("Slow light object. Ignore it")
2164
2168
else:
2165
2169
print("Whoa! Something is up with the data. Check it")
@@ -2168,32 +2172,74 @@ for i in range(5):
2168
2172
- Use () to group subsets of conditions
2169
2173
- Aside: For a more natural way of working with many lists, look at `zip()`
2170
2174
2175
+
### Use the modulus to print occasional status messages
2176
+
2177
+
Conditionals are often used inside loops.
2178
+
2179
+
``` python
2180
+
data_frames = []
2181
+
for count, filename in enumerate(glob.glob('data/gapminder_[!all]*.csv')):
2182
+
# Print every other filename
2183
+
if count % 2 == 0:
2184
+
print(count, filename)
2185
+
data = pd.read_csv(filename)
2186
+
data_frames.append(data)
2187
+
2188
+
all_data = pd.concat(data_frames)
2189
+
print(all_data.shape)
2190
+
```
2191
+
2192
+
### **Challenge**: Process small files
2193
+
2194
+
Iterate through all of the CSV files in the data directory. Print the file name and file length for any file that is less than 30 lines long.
2195
+
2196
+
#### Solution
2197
+
2198
+
``` python
2199
+
for filename in glob.glob('data/*.csv'):
2200
+
data = pd.read_csv(filename)
2201
+
if len(data) < 30:
2202
+
print(filename, len(data))
2203
+
```
2204
+
2171
2205
### (Optional) Use pathlib to write code that works across operating systems
2172
2206
2173
2207
1. Pathlib provides cross-platform path objects
2174
2208
2175
2209
``` python
2176
2210
from pathlib import Path
2177
2211
2178
-
relative_path = Path("data") # data subdirectory
2179
-
# relative_path = Path() # current directory
2180
-
print("Relative path:", relative_path)
2181
-
print("Absolute path:", relative_path.absolute())
2212
+
# Create Path objects
2213
+
raw_path = Path("data")
2214
+
processed_path = Path("data/processed")
2215
+
2216
+
print("Relative path:", raw_path)
2217
+
print("Absolute path:", raw_path.absolute())
2182
2218
```
2183
2219
2184
2220
2. The file objects have methods that provide much better information about files and directories.
2185
2221
2186
2222
``` python
2187
2223
#Note the careful testing at each level of the code.
2224
+
data_frames = []
2225
+
2188
2226
if relative_path.exists():
2189
-
for filename in relative_path.glob('gapminder_*.csv'):
2227
+
for filename in raw_path.glob('gapminder_[!all]*.csv'):
2190
2228
if filename.is_file():
2191
2229
data = pd.read_csv(filename)
2192
2230
print(filename)
2193
-
print(data.head(1))
2231
+
data_frames.append(data)
2232
+
2233
+
all_data = pd.concat(data_frames)
2234
+
2235
+
# Check for destination folder and create if it doesn't exist
0 commit comments