Skip to content

Commit 8de68fe

Browse files
Small amends on the masked arrays tutorial (numpy#115)
* tutorial-ma: Update data source * tutorial-ma: Update date columns range * tutorial-ma: Update number of rows to skip * tutorial-ma: Ignore row with totals * tutorial-ma: Spell-check
1 parent e0a1237 commit 8de68fe

File tree

2 files changed

+283
-124
lines changed

2 files changed

+283
-124
lines changed

content/tutorial-ma.md

+13-9
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ Use the masked arrays module from NumPy to analyze COVID-19 data and deal with m
3838

3939
## What are masked arrays?
4040

41-
Consider the following problem. You have a dataset with missing or invalid entries. If you're doing any kind of processing on this data, and want to *skip* or flag these unwanted entries without just deleting them, you may have to use conditionals or filter your data somehow. The [numpy.ma](https://numpy.org/devdocs/reference/maskedarray.generic.html#module-numpy.ma) module provides some of the same funcionality of [NumPy ndarrays](https://numpy.org/devdocs/reference/generated/numpy.ndarray.html#numpy.ndarray) with added structure to ensure invalid entries are not used in computation.
41+
Consider the following problem. You have a dataset with missing or invalid entries. If you're doing any kind of processing on this data, and want to *skip* or flag these unwanted entries without just deleting them, you may have to use conditionals or filter your data somehow. The [numpy.ma](https://numpy.org/devdocs/reference/maskedarray.generic.html#module-numpy.ma) module provides some of the same functionality of [NumPy ndarrays](https://numpy.org/devdocs/reference/generated/numpy.ndarray.html#numpy.ndarray) with added structure to ensure invalid entries are not used in computation.
4242

4343
From the [Reference Guide](https://numpy.org/devdocs/reference/maskedarray.generic.html#module-numpy.ma):
4444

@@ -83,28 +83,28 @@ The data file contains data of different types and is organized as follows:
8383
- The second through seventh row contain summary data that is of a different type than that which we are going to examine, so we will need to exclude that from the data with which we will work.
8484
- The numerical data we wish to work with begins at column 4, row 8, and extends from there to the rightmost column and the lowermost row.
8585

86-
Let's explore the data inside this file for the first 14 days of records. To gather data from the `.csv` file, we will use the [numpy.genfromtxt](https://numpy.org/devdocs/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt) function, making sure we select only the columns with actual numbers instead of the first three columns which contain location data. We also skip the first 7
86+
Let's explore the data inside this file for the first 14 days of records. To gather data from the `.csv` file, we will use the [numpy.genfromtxt](https://numpy.org/devdocs/reference/generated/numpy.genfromtxt.html#numpy.genfromtxt) function, making sure we select only the columns with actual numbers instead of the first four columns which contain location data. We also skip the first 6
8787
rows of this file, since they contain other data we are not interested in. Separately, we will extract the information about dates and location for this data.
8888

8989
```{code-cell}
9090
# Note we are using skip_header and usecols to read only portions of the
9191
# data file into each variable.
92-
# Read just the dates for columns 3-7 from the first row
92+
# Read just the dates for columns 4-18 from the first row
9393
dates = np.genfromtxt(
9494
filename,
9595
dtype=np.unicode_,
9696
delimiter=",",
9797
max_rows=1,
98-
usecols=range(3, 17),
98+
usecols=range(4, 18),
9999
encoding="utf-8-sig",
100100
)
101101
# Read the names of the geographic locations from the first two
102-
# columns, skipping the first seven rows
102+
# columns, skipping the first six rows
103103
locations = np.genfromtxt(
104104
filename,
105105
dtype=np.unicode_,
106106
delimiter=",",
107-
skip_header=7,
107+
skip_header=6,
108108
usecols=(0, 1),
109109
encoding="utf-8-sig",
110110
)
@@ -113,8 +113,8 @@ nbcases = np.genfromtxt(
113113
filename,
114114
dtype=np.int_,
115115
delimiter=",",
116-
skip_header=7,
117-
usecols=range(3, 17),
116+
skip_header=6,
117+
usecols=range(4, 18),
118118
encoding="utf-8-sig",
119119
)
120120
```
@@ -136,9 +136,13 @@ plt.xticks(selected_dates, dates[selected_dates])
136136
plt.title("COVID-19 cumulative cases from Jan 21 to Feb 3 2020")
137137
```
138138

139-
The graph has a strange shape from January 24th to February 1st. It would be interesing to know where this data comes from. If we look at the `locations` array we extracted from the `.csv` file, we can see that we have two columns, where the first would contain regions and the second would contain the name of the country. However, only the first few rows contain data for the the first column (province names in China). Following that, we only have country names. So it would make sense to group all the data from China into a single row. For this, we'll select from the `nbcases` array only the rows for which the second entry of the `locations` array corresponds to China. Next, we'll use the [numpy.sum](https://numpy.org/devdocs/reference/generated/numpy.sum.html#numpy.sum) function to sum all the selected rows (`axis=0`):
139+
The graph has a strange shape from January 24th to February 1st. It would be interesting to know where this data comes from. If we look at the `locations` array we extracted from the `.csv` file, we can see that we have two columns, where the first would contain regions and the second would contain the name of the country. However, only the first few rows contain data for the the first column (province names in China). Following that, we only have country names. So it would make sense to group all the data from China into a single row. For this, we'll select from the `nbcases` array only the rows for which the second entry of the `locations` array corresponds to China. Next, we'll use the [numpy.sum](https://numpy.org/devdocs/reference/generated/numpy.sum.html#numpy.sum) function to sum all the selected rows (`axis=0`). Note also that row 35 corresponds to the total counts for the whole country for each date. Since we want to calculate the sum ourselves from the provinces data, we have to remove that row first from both `locations` and `nbcases`:
140140

141141
```{code-cell}
142+
totals_row = 35
143+
locations = np.delete(locations, (totals_row), axis=0)
144+
nbcases = np.delete(nbcases, (totals_row), axis=0)
145+
142146
china_total = nbcases[locations[:, 1] == "China"].sum(axis=0)
143147
china_total
144148
```

0 commit comments

Comments
 (0)