@@ -203,6 +203,10 @@ The pipeline is defined in the ``main.py`` file in the root of the starter kit.
203
203
contains some boilerplate code as well as the download step. Your task will be to develop the
204
204
needed additional step, and then add them to the `` main.py `` file.
205
205
206
+ __ * NOTE* __ : the modeling in this exercise should be considered a baseline. We kept the data cleaning and the modeling
207
+ simple because we want to focus on the MLops aspect of the analysis. It is possible with a little more effort to get
208
+ a significantly-better model for this dataset.
209
+
206
210
### Exploratory Data Analysis (EDA)
207
211
The scope of this section is to get an idea of how the process of an EDA works in the context of
208
212
pipelines, during the data exploration phase. In a real scenario you would spend a lot more time
@@ -215,78 +219,69 @@ notebook can be understood by other people like your colleagues
215
219
get a sample of the data. The pipeline will also upload it to Weights & Biases:
216
220
217
221
``` bash
218
- > mlflow run .
222
+ > mlflow run . -P steps=download
219
223
```
220
224
221
225
You will see a message similar to:
222
226
223
227
```
224
228
2021-03-12 15:44:39,840 Uploading sample.csv to Weights & Biases
225
229
```
226
- This tells you that the data have been uploaded to W&B as the artifact named `` sample.csv `` .
230
+ This tells you that the data is going to be stored in W&B as the artifact named `` sample.csv `` .
227
231
228
- 2 . Go in the `` notebook `` directory and start Jupyter, and then create a notebook called `` EDA `` .
232
+ 2 . Now execute the ` eda ` step:
233
+ ``` bash
234
+ > mlflow run src/eda
235
+ ```
236
+ This will install Jupyter and all the dependencies for ` pandas-profiling ` , and open a Jupyter notebook instance.
237
+ Click on New -> Python 3 and create a new notebook. Rename it ` EDA ` by clicking on ` Untitled ` at the top, beside the
238
+ Jupyter logo.
229
239
3 . Within the notebook, fetch the artifact we just created (`` sample.csv `` ) from W&B and read
230
240
it with pandas:
231
241
232
242
``` python
233
243
import wandb
234
244
import pandas as pd
235
245
236
- wandb.init(project = " nyc_airbnb" , group = " eda" , save_code = True )
246
+ run = wandb.init(project = " nyc_airbnb" , group = " eda" , save_code = True )
237
247
local_path = wandb.use_artifact(" sample.csv:latest" ).file()
238
248
df = pd.read_csv(local_path)
239
249
```
240
250
Note the use of `` save_code= True `` in the call to `` wandb.init`` . This makes sure that the
241
251
code of the notebook is uploaded to W& B for versioning, so that each run of the notebook
242
252
will be tied to the specific version of the code that run in that step.
243
253
244
- 4 . Print a summary (df.info()) and note the null values. Also, note how there are clearly
245
- outliers in the `` price`` column :
246
-
254
+ 4 . Using `pandas- profiling` , create a profile:
247
255
```python
248
- df[' price' ].describe(percentiles = [0.01 , 0.05 , 0.50 , 0.95 , 0.99 ])
249
- count 20000.000000
250
- mean 153.269050
251
- std 243.325609
252
- min 0.000000
253
- 1 % 30.000000
254
- 5 % 40.000000
255
- 50 % 105.000000
256
- 95 % 350.000000
257
- 99 % 800.000000
258
- max 10000.000000
259
- Name: price, dtype: float64
256
+ import pandas_profiling
257
+
258
+ profile = pandas_profiling.ProfileReport(df)
259
+ profile.to_widgets()
260
260
```
261
- After talking to your stakeholders, you decide to consider from a minimum of $ 10 to a maximum
262
- of $ 350 per night.
261
+ what do you notice? Look around and see what you can find.
263
262
264
- 5 . Fix the little problems we have found in the data with the following code:
265
-
266
- ``` python
267
- # Drop outliers
268
- min_price = 10
269
- max_price = 350
270
- idx = df[' price' ].between(min_price, max_price)
271
- df = df[idx]
263
+ For example, there are missing values in a few columns and the column ` last_review ` is a
264
+ date but it is in string format. Look also at the ` price ` column, and note the outliers. There are some zeros and
265
+ some very high prices. After talking to your stakeholders, you decide to consider from a minimum of $ 10 to a
266
+ maximum of $ 350 per night.
272
267
273
- # Convert last_review to datetime
274
- df[ ' last_review ' ] = pd.to_datetime(df[ ' last_review ' ])
275
-
276
- # Fill the null dates with an old date
277
- df[ ' last_review ' ].fillna(pd.to_datetime( " 2010-01-01 " ), inplace = True )
278
-
279
- # If the reviews_per_month is nan it means that there is no review
280
- df[' reviews_per_month ' ].fillna( 0 , inplace = True )
281
-
282
- # We can fill the names with a short string.
283
- # DO NOT use empty strings here
284
- df[ ' name ' ].fillna( ' - ' , inplace = True )
285
- df[ ' host_name ' ].fillna( ' - ' , inplace = True )
286
- ```
287
- 6 . Check with `` df.info ()`` that all obvious problems have been solved
288
- 7 . Save and close the notebook, shutdown Jupyter, then go back to the root directory.
289
- 8 . Commit the code to github
268
+ 5 . Fix some of the little problems we have found in the data with the following code:
269
+
270
+ ``` python
271
+ # Drop outliers
272
+ min_price = 10
273
+ max_price = 350
274
+ idx = df[ ' price ' ].between(min_price, max_price)
275
+ df = df[idx].copy( )
276
+ # Convert last_review to datetime
277
+ df[ ' last_review ' ] = pd.to_datetime(df[ ' last_review ' ])
278
+ ```
279
+ Note how we did not impute missing values. We will do that in the inference pipeline, so we will be able to handle
280
+ missing values also in production.
281
+ 6 . Create a new profile or check with `` df.info() `` that all obvious problems have been solved
282
+ 7 . Terminate the run by running ` run.finish ()`
283
+ 8 . Save the notebook, then close it (File -> Close and Halt). In the main Jupyter notebook page, click Quit in the
284
+ upper right to stop Jupyter. This will also terminate the mlflow run. DO NOT USE CRTL-C
290
285
291
286
## Data cleaning
292
287
@@ -296,7 +291,7 @@ with the cleaned data:
296
291
297
292
1 . Make sure you are in the root directory of the starter kit, then create a stub
298
293
for the new step. The new step should accept the parameters `` input_artifact ``
299
- (the input artifact), `` output_name `` (the name for the output artifact),
294
+ (the input artifact), `` output_artifact `` (the name for the output artifact),
300
295
`` output_type `` (the type for the output artifact), `` output_description ``
301
296
(a description for the output artifact), `` min_price `` (the minimum price to consider)
302
297
and `` max_price `` (the maximum price to consider):
@@ -308,7 +303,7 @@ with the cleaned data:
308
303
job_type [my_step]: basic_cleaning
309
304
short_description [My step]: A very basic data cleaning
310
305
long_description [An example of a step using MLflow and Weights & Biases]: Download from W& B the raw dataset and apply some basic data cleaning, exporting the result to a new artifact
311
- arguments [default ]: input_artifact,output_name ,output_type,output_description,min_price,max_price
306
+ parameters [parameter1,parameter2 ]: input_artifact,output_artifact ,output_type,output_description,min_price,max_price
312
307
```
313
308
This will create a directory `` src/basic_cleaning `` containing the basic files required
314
309
for a MLflow step: `` conda.yml `` , `` MLproject `` and the script (which we named `` run.py `` ).
@@ -318,48 +313,39 @@ with the cleaned data:
318
313
comments like `` INSERT TYPE HERE `` and `` INSERT DESCRIPTION HERE `` ). All parameters should be
319
314
of type `` str `` except `` min_price `` and `` max_price `` that should be `` float `` .
320
315
321
- 3 . Implement in the section marked ```# YOUR CODE HERE #``` the data cleaning steps we
322
- have implemented in the notebook. Remember to use the `` logger`` instance already provided
323
- to print meaningful messages to screen. For example, let' s start by downloading the input
324
- artifact and read it with pandas:
316
+ 3 . Implement in the section marked ``` # YOUR CODE HERE # ``` the steps we
317
+ have implemented in the notebook, including downloading the data from W&B.
318
+ Remember to use the `` logger `` instance already provided to print meaningful messages to screen.
325
319
326
- ```python
327
- logger.info(f " Fetching { args.input_artifact} from W&B... " )
328
- artifact_local_path = run.use_artifact(args.input_artifact).file()
329
-
330
- logger.info(" Reading with pandas" )
331
- df = pd.read_csv(artifact_local_path)
332
- ```
333
-
334
- ** _REMEMBER__ ** : Whenever you are using a library (like pandas here), you MUST add it as
335
- dependency in the `` conda.yml`` file . For example, here we are using pandas
336
- so we must add it to `` conda.yml`` file , including a version:
337
- ```yaml
338
- dependencies:
339
- - pip=20.3 .3
340
- - pandas=1.2 .3
341
- - pip:
342
- - wandb== 0.10 .21
343
- ```
344
-
345
- Then implement the cleaning code we have used in the notebook, making sure to use
346
- `` args.min_price`` and `` args.max_price`` when dropping the outliers
320
+ Make sure to use `` args.min_price `` and `` args.max_price `` when dropping the outliers
347
321
(instead of hard-coding the values like we did in the notebook).
348
322
Save the results to a CSV file called `` clean_sample.csv ``
349
- (`` df.to_csv(" clean_sample.csv" , index = False )`` ) then upload it to W& B using:
323
+ (`` df.to_csv("clean_sample.csv", index=False) `` )
324
+ ** _ NOTE_ ** : Remember to use `` index=False `` when saving to CSV, otherwise the data checks in
325
+ the next step might fail because there will be an extra `` index `` column
326
+
327
+ Then upload it to W&B using:
350
328
351
329
``` python
352
330
artifact = wandb.Artifact(
353
- args.output_name ,
331
+ args.output_artifact ,
354
332
type = args.output_type,
355
333
description = args.output_description,
356
334
)
357
335
artifact.add_file(" clean_sample.csv" )
358
336
run.log_artifact(artifact)
359
337
```
360
338
361
- ** _NOTE_ ** : Remember to use `` index=False `` when saving to CSV , otherwise the data checks in
362
- the next step will fail!
339
+ ** _ REMEMBER__ ** : Whenever you are using a library (like pandas), you MUST add it as
340
+ dependency in the `` conda.yml `` file. For example, here we are using pandas
341
+ so we must add it to `` conda.yml `` file, including a version:
342
+ ``` yaml
343
+ dependencies :
344
+ - pip=20.3.3
345
+ - pandas=1.2.3
346
+ - pip :
347
+ - wandb==0.10.31
348
+ ` ` `
363
349
364
350
4. Add the ` ` basic_cleaning` ` step to the pipeline (the ` ` main.py` ` file):
365
351
@@ -382,15 +368,16 @@ with the cleaned data:
382
368
" main" ,
383
369
parameters={
384
370
" input_artifact " : " sample.csv:latest" ,
385
- " output_name " : " clean_sample.csv" ,
371
+ " output_artifact " : " clean_sample.csv" ,
386
372
" output_type " : " clean_sample" ,
387
373
" output_description " : " Data with outliers and null values removed" ,
388
374
" min_price " : config['etl']['min_price'],
389
375
" max_price " : config['etl']['max_price']
390
376
},
391
377
)
392
378
```
393
- 5 . Run the pipeline
379
+ 5 . Run the pipeline. If you go to W&B, you will see the new artifact type ` clean_sample ` and within it the
380
+ ` clean_sample.csv ` artifact
394
381
395
382
### Data testing
396
383
After the cleaning, it is a good practice to put some tests that verify that the data does not
0 commit comments