Skip to content

Commit d86f86a

Browse files
committed
Updating instructions
1 parent 3f7fc48 commit d86f86a

File tree

5 files changed

+88
-80
lines changed

5 files changed

+88
-80
lines changed

README.md

+66-79
Original file line numberDiff line numberDiff line change
@@ -203,6 +203,10 @@ The pipeline is defined in the ``main.py`` file in the root of the starter kit.
203203
contains some boilerplate code as well as the download step. Your task will be to develop the
204204
needed additional step, and then add them to the ``main.py`` file.
205205

206+
__*NOTE*__: the modeling in this exercise should be considered a baseline. We kept the data cleaning and the modeling
207+
simple because we want to focus on the MLops aspect of the analysis. It is possible with a little more effort to get
208+
a significantly-better model for this dataset.
209+
206210
### Exploratory Data Analysis (EDA)
207211
The scope of this section is to get an idea of how the process of an EDA works in the context of
208212
pipelines, during the data exploration phase. In a real scenario you would spend a lot more time
@@ -215,78 +219,69 @@ notebook can be understood by other people like your colleagues
215219
get a sample of the data. The pipeline will also upload it to Weights & Biases:
216220

217221
```bash
218-
> mlflow run .
222+
> mlflow run . -P steps=download
219223
```
220224

221225
You will see a message similar to:
222226

223227
```
224228
2021-03-12 15:44:39,840 Uploading sample.csv to Weights & Biases
225229
```
226-
This tells you that the data have been uploaded to W&B as the artifact named ``sample.csv``.
230+
This tells you that the data is going to be stored in W&B as the artifact named ``sample.csv``.
227231

228-
2. Go in the ``notebook`` directory and start Jupyter, and then create a notebook called ``EDA``.
232+
2. Now execute the `eda` step:
233+
```bash
234+
> mlflow run src/eda
235+
```
236+
This will install Jupyter and all the dependencies for `pandas-profiling`, and open a Jupyter notebook instance.
237+
Click on New -> Python 3 and create a new notebook. Rename it `EDA` by clicking on `Untitled` at the top, beside the
238+
Jupyter logo.
229239
3. Within the notebook, fetch the artifact we just created (``sample.csv``) from W&B and read
230240
it with pandas:
231241

232242
```python
233243
import wandb
234244
import pandas as pd
235245

236-
wandb.init(project="nyc_airbnb", group="eda", save_code=True)
246+
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
237247
local_path = wandb.use_artifact("sample.csv:latest").file()
238248
df = pd.read_csv(local_path)
239249
```
240250
Note the use of ``save_code=True`` in the call to ``wandb.init``. This makes sure that the
241251
code of the notebook is uploaded to W&B for versioning, so that each run of the notebook
242252
will be tied to the specific version of the code that run in that step.
243253

244-
4. Print a summary (df.info()) and note the null values. Also, note how there are clearly
245-
outliers in the ``price`` column :
246-
254+
4. Using `pandas-profiling`, create a profile:
247255
```python
248-
df['price'].describe(percentiles=[0.01, 0.05, 0.50, 0.95, 0.99])
249-
count 20000.000000
250-
mean 153.269050
251-
std 243.325609
252-
min 0.000000
253-
1% 30.000000
254-
5% 40.000000
255-
50% 105.000000
256-
95% 350.000000
257-
99% 800.000000
258-
max 10000.000000
259-
Name: price, dtype: float64
256+
import pandas_profiling
257+
258+
profile = pandas_profiling.ProfileReport(df)
259+
profile.to_widgets()
260260
```
261-
After talking to your stakeholders, you decide to consider from a minimum of $ 10 to a maximum
262-
of $ 350 per night.
261+
what do you notice? Look around and see what you can find.
263262

264-
5. Fix the little problems we have found in the data with the following code:
265-
266-
```python
267-
# Drop outliers
268-
min_price = 10
269-
max_price = 350
270-
idx = df['price'].between(min_price, max_price)
271-
df = df[idx]
263+
For example, there are missing values in a few columns and the column `last_review` is a
264+
date but it is in string format. Look also at the `price` column, and note the outliers. There are some zeros and
265+
some very high prices. After talking to your stakeholders, you decide to consider from a minimum of $ 10 to a
266+
maximum of $ 350 per night.
272267

273-
# Convert last_review to datetime
274-
df['last_review'] = pd.to_datetime(df['last_review'])
275-
276-
# Fill the null dates with an old date
277-
df['last_review'].fillna(pd.to_datetime("2010-01-01"), inplace=True)
278-
279-
# If the reviews_per_month is nan it means that there is no review
280-
df['reviews_per_month'].fillna(0, inplace=True)
281-
282-
# We can fill the names with a short string.
283-
# DO NOT use empty strings here
284-
df['name'].fillna('-', inplace=True)
285-
df['host_name'].fillna('-', inplace=True)
286-
```
287-
6. Check with ``df.info()`` that all obvious problems have been solved
288-
7. Save and close the notebook, shutdown Jupyter, then go back to the root directory.
289-
8. Commit the code to github
268+
5. Fix some of the little problems we have found in the data with the following code:
269+
270+
```python
271+
# Drop outliers
272+
min_price = 10
273+
max_price = 350
274+
idx = df['price'].between(min_price, max_price)
275+
df = df[idx].copy()
276+
# Convert last_review to datetime
277+
df['last_review'] = pd.to_datetime(df['last_review'])
278+
```
279+
Note how we did not impute missing values. We will do that in the inference pipeline, so we will be able to handle
280+
missing values also in production.
281+
6. Create a new profile or check with ``df.info()`` that all obvious problems have been solved
282+
7. Terminate the run by running `run.finish()`
283+
8. Save the notebook, then close it (File -> Close and Halt). In the main Jupyter notebook page, click Quit in the
284+
upper right to stop Jupyter. This will also terminate the mlflow run. DO NOT USE CRTL-C
290285

291286
## Data cleaning
292287

@@ -296,7 +291,7 @@ with the cleaned data:
296291

297292
1. Make sure you are in the root directory of the starter kit, then create a stub
298293
for the new step. The new step should accept the parameters ``input_artifact``
299-
(the input artifact), ``output_name`` (the name for the output artifact),
294+
(the input artifact), ``output_artifact`` (the name for the output artifact),
300295
``output_type`` (the type for the output artifact), ``output_description``
301296
(a description for the output artifact), ``min_price`` (the minimum price to consider)
302297
and ``max_price`` (the maximum price to consider):
@@ -308,7 +303,7 @@ with the cleaned data:
308303
job_type [my_step]: basic_cleaning
309304
short_description [My step]: A very basic data cleaning
310305
long_description [An example of a step using MLflow and Weights & Biases]: Download from W&B the raw dataset and apply some basic data cleaning, exporting the result to a new artifact
311-
arguments [default]: input_artifact,output_name,output_type,output_description,min_price,max_price
306+
parameters [parameter1,parameter2]: input_artifact,output_artifact,output_type,output_description,min_price,max_price
312307
```
313308
This will create a directory ``src/basic_cleaning`` containing the basic files required
314309
for a MLflow step: ``conda.yml``, ``MLproject`` and the script (which we named ``run.py``).
@@ -318,48 +313,39 @@ with the cleaned data:
318313
comments like ``INSERT TYPE HERE`` and ``INSERT DESCRIPTION HERE``). All parameters should be
319314
of type ``str`` except ``min_price`` and ``max_price`` that should be ``float``.
320315

321-
3. Implement in the section marked ```# YOUR CODE HERE #``` the data cleaning steps we
322-
have implemented in the notebook. Remember to use the ``logger`` instance already provided
323-
to print meaningful messages to screen. For example, let's start by downloading the input
324-
artifact and read it with pandas:
316+
3. Implement in the section marked ```# YOUR CODE HERE #``` the steps we
317+
have implemented in the notebook, including downloading the data from W&B.
318+
Remember to use the ``logger`` instance already provided to print meaningful messages to screen.
325319

326-
```python
327-
logger.info(f"Fetching {args.input_artifact} from W&B...")
328-
artifact_local_path = run.use_artifact(args.input_artifact).file()
329-
330-
logger.info("Reading with pandas")
331-
df = pd.read_csv(artifact_local_path)
332-
```
333-
334-
**_REMEMBER__**: Whenever you are using a library (like pandas here), you MUST add it as
335-
dependency in the ``conda.yml`` file. For example, here we are using pandas
336-
so we must add it to ``conda.yml`` file, including a version:
337-
```yaml
338-
dependencies:
339-
- pip=20.3.3
340-
- pandas=1.2.3
341-
- pip:
342-
- wandb==0.10.21
343-
```
344-
345-
Then implement the cleaning code we have used in the notebook, making sure to use
346-
``args.min_price`` and ``args.max_price`` when dropping the outliers
320+
Make sure to use ``args.min_price`` and ``args.max_price`` when dropping the outliers
347321
(instead of hard-coding the values like we did in the notebook).
348322
Save the results to a CSV file called ``clean_sample.csv``
349-
(``df.to_csv("clean_sample.csv", index=False)``) then upload it to W&B using:
323+
(``df.to_csv("clean_sample.csv", index=False)``)
324+
**_NOTE_**: Remember to use ``index=False`` when saving to CSV, otherwise the data checks in
325+
the next step might fail because there will be an extra ``index`` column
326+
327+
Then upload it to W&B using:
350328

351329
```python
352330
artifact = wandb.Artifact(
353-
args.output_name,
331+
args.output_artifact,
354332
type=args.output_type,
355333
description=args.output_description,
356334
)
357335
artifact.add_file("clean_sample.csv")
358336
run.log_artifact(artifact)
359337
```
360338

361-
**_NOTE_**: Remember to use ``index=False`` when saving to CSV, otherwise the data checks in
362-
the next step will fail!
339+
**_REMEMBER__**: Whenever you are using a library (like pandas), you MUST add it as
340+
dependency in the ``conda.yml`` file. For example, here we are using pandas
341+
so we must add it to ``conda.yml`` file, including a version:
342+
```yaml
343+
dependencies:
344+
- pip=20.3.3
345+
- pandas=1.2.3
346+
- pip:
347+
- wandb==0.10.31
348+
```
363349
364350
4. Add the ``basic_cleaning`` step to the pipeline (the ``main.py`` file):
365351
@@ -382,15 +368,16 @@ with the cleaned data:
382368
"main",
383369
parameters={
384370
"input_artifact": "sample.csv:latest",
385-
"output_name": "clean_sample.csv",
371+
"output_artifact": "clean_sample.csv",
386372
"output_type": "clean_sample",
387373
"output_description": "Data with outliers and null values removed",
388374
"min_price": config['etl']['min_price'],
389375
"max_price": config['etl']['max_price']
390376
},
391377
)
392378
```
393-
5. Run the pipeline
379+
5. Run the pipeline. If you go to W&B, you will see the new artifact type `clean_sample` and within it the
380+
`clean_sample.csv` artifact
394381

395382
### Data testing
396383
After the cleaning, it is a good practice to put some tests that verify that the data does not

cookie-mlflow-step/{{cookiecutter.step_name}}/MLproject

+2-1
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,5 @@ entry_points:
1010
type: string
1111
{% endfor %}
1212

13-
command: "python {{cookiecutter.script_name}} {% for n in cookiecutter.parameters.split(",") %} --{{n}} {{"{"}}{{n}}{{"}"}} {% endfor %}"
13+
command: >-
14+
python {{cookiecutter.script_name}} {% for n in cookiecutter.parameters.split(",") %} --{{n}} {{"{"}}{{n}}{{"}"}} {% endfor %}

notebooks/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Save your EDA here

src/eda/MLproject

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
name: eda
2+
conda_env: conda.yml
3+
4+
entry_points:
5+
main:
6+
command: jupyter notebook

src/eda/conda.yml

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
name: eda
2+
channels:
3+
- conda-forge
4+
- defaults
5+
dependencies:
6+
- jupyterlab=3.0.12
7+
- seaborn=0.11.1
8+
- pandas=1.2.3
9+
- pip=20.3.3
10+
- pandas-profiling=2.11.0
11+
- pyarrow=2.0
12+
- pip:
13+
- wandb==0.10.31

0 commit comments

Comments
 (0)