completed assignment

adityashrm21 · Mar 22, 2019 · f294fa4 · f294fa4
1 parent c98cd10
commit f294fa4
Show file tree

Hide file tree

Showing 30 changed files with 100,785 additions and 2 deletions.
diff --git a/Pipfile b/Pipfile
@@ -0,0 +1,19 @@
+[[source]]
+name = "pypi"
+url = "https://pypi.org/simple"
+verify_ssl = true
+
+[dev-packages]
+
+[packages]
+scipy = "*"
+scikit-learn = "*"
+pandas = "*"
+gunicorn = "*"
+peewee = "*"
+psycopg2 = "*"
+Flask = "*"
+category_encoders = "==1.2.6"
+
+[requires]
+python_version = "3.7"
diff --git a/Pipfile.lock b/Pipfile.lock
diff --git a/Procfile b/Procfile
@@ -0,0 +1,2 @@
+web: gunicorn app:app
+
diff --git a/README.md b/README.md
@@ -1,2 +1,103 @@
-# adult-income-prediction
-An EDA with a deployed logistic regression model on the adult dataset to predict income.
+# Adult Income Prediction using Flask app on Heroku
+
+Follow the steps provided below to reproduce the whole project.
+
+### Setting up a virtual environment
+
+We'll use a virtual environment for this one.
+All of the necessary dependencies exist in `requirements.txt`.
+
+Install pipenv using the instructons given in [this repository](https://github.com/pypa/pipenv).
+
+After installing `pipenv`, from the project folder (i.e., where this readme lives in your computer), create a local virtual environment and install the project dependencies from inside the root directory using `requirements.txt` with the command:
+
+```bash
+$ pipenv install -r requirements.txt
+```
+We would need to activate the virtual environment to start working inside it.
+To activate this project's virtualenv, run the following:
+```
+$ pipenv shell
+```
+You can deactivate the virtualenv by either typing `exit` or pressing `CTRL+d`
+
+Now you should have everything installed that we need.
+
+### Data format before cleaning
+
+This information is directly copied from the [UCI datasets repository for adult dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names)
+
+- income: >50K, <=50K.
+- age: continuous.
+- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
+- fnlwgt: continuous.
+- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
+- education-num: continuous.
+- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
+- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
+- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
+- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
+- sex: Female, Male.
+- capital-gain: continuous.
+- capital-loss: continuous.
+- hours-per-week: continuous.
+- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
+
+### Run the EDA script to export cleaned data
+
+```bash
+R -e rmarkdown::render"('eda_adult.Rmd', clean=TRUE, output_format='pdf_document')"
+```
+The script `eda_adult.pdf` will contain the exploratory analysis report on the adult income dataset.
+
+### Data format after cleaning
+
+The data format has been converted to numbers for each variable and binned together (some categorical variables) into different categories (passed as strings) which are as defined below:
+
+- income: 0 ('>50K'), 1 ('<=50K')
+- age: continuous.
+- workclass: 0 ('State-gov', 'Federal-gov', 'Local-gov'), 1 ('Self-emp-not-inc', 'Self-emp-inc', 'Without-pay', 'Never-worked'), 2 ('Private'), -1 ('unknown')
+- education: 0 ("HS-grad", "11th", "9th", "7th-8th", "5th-6th", "10th", "Preschool", "12th", "1st-4th"), 1 ("Bachelors", "Some-college", "Assoc-acdm", "Assoc-voc"), 2 ("Masters", "Prof-school", "Doctorate", "Assoc-voc")
+- marital_status: 0 ('Married-civ-spouse', 'Married-spouse-absent', 'Married-AF-spouse'), 1 ('Never-married','Divorced', 'Separated','Widowed')
+- occupation: 0 ("Priv-house-serv", "Handlers-cleaners", "Other-service", "Armed-Forces", "Machine-op-inspct", "Farming-fishing", "Adm-clerical"), 1 ("Tech-support", "Craft-repair", "Protective-serv", "Transport-moving", "Sales"), 2 ("Exec-managerial", "Prof-specialty"), -1 ("unknown")
+- race: 0 ("White"), 1 ("Black"), 2 ("Asian-Pac-Islander", "Amer-Indian-Eskimo", "Other")
+- sex: 0 ("Female"), 1 ("Male")
+- capital-gain: continuous.
+- capital-loss: continuous.
+- hours-per-week: continuous.
+- native_country: 1 ("United-States"), 0 (all the rest of countries)
+
+Note: `unknown` refers to the cells that were missing in the respective columns (`?`)
+
+### Model choice
+
+Execute the `main.py` script which will train a model on the cleaned data and export a it along with the required data info for deployment on Heroku.
+
+```bash
+python3 incomePrediction/main.py
+```
+I chose the `LogisticRegression` classifier from scikit-learn to get predictions (The test accuracy obtained is quite well ~ 85%). Cross-validation is done to choose the important hyperparameter (`C`) to control the degree of regularization. The script can be modified to use and tune any classifier available in `scikit-learn`. Both the training and test accuracies are comparable and hence, there seems to be no overfitting. I chose to go with Logistic Regression because it is a simple linear classifier whose results are interpretable and this is what I would expect from a model on such a dataset where the predictor-response relationship seems to be important in the analysis.
+
+### Deploy the model on Heroku
+
+The exported model is deployed as a microservice on Heroku using the steps given in [this repository](https://github.com/LDSSA/heroku-model-deploy#sign-up-and-set-up-at-heroku).
+
+### Steps to get predictions
+
+The model has been deployed on heroku and to get the predictions, you can use the following curl request:
+
+```bash
+curl -X POST https://adult-income-prediction.herokuapp.com/predict -d '{"id": 8, "observation": {"age": 39, "workclass": "2", "education": "2", "marital_status": "0", "occupation": "2", "race" : "0", "sex": "1", "capital_gain": 1230, "capital_loss": 0, "hours_per_week": 55, "native_country": "1"}}' -H "Content-Type:application/json"
+```
+
+Note: Make sure to use a separate observation id for each request (if you want to store the results properly) as we are storing the results in an sqlite database and this would be the primary identifier for the records.
+
+### Testing framework
+
+I have used the `pytest` library to test the `Util` class. To run the tests use the following command from the root directory of the project:
+
+```bash
+pytest incomePrediction/tests/
+```
+
+Due to shortage on time, I could not cover all kinds of tests but I did set up a basic test infrastructure which could be extended to test the remaining code (unit, integration and e2e tests).
diff --git a/__pycache__/utils.cpython-36.pyc b/__pycache__/utils.cpython-36.pyc
diff --git a/app.py b/app.py
@@ -0,0 +1,126 @@
+# adapted from https://github.com/LDSSA/heroku-model-deploy
+import os
+import json
+import pickle
+from sklearn.externals import joblib
+import pandas as pd
+from flask import Flask, jsonify, request
+from peewee import (
+    SqliteDatabase, PostgresqlDatabase, Model, IntegerField,
+    FloatField, TextField, IntegrityError
+)
+from playhouse.shortcuts import model_to_dict
+
+
+########################################
+# Begin database stuff
+
+if 'DATABASE_URL' in os.environ:
+    db_url = os.environ['DATABASE_URL']
+    dbname = db_url.split('@')[1].split('/')[1]
+    user = db_url.split('@')[0].split(':')[1].lstrip('//')
+    password = db_url.split('@')[0].split(':')[2]
+    host = db_url.split('@')[1].split('/')[0].split(':')[0]
+    port = db_url.split('@')[1].split('/')[0].split(':')[1]
+    DB = PostgresqlDatabase(
+        dbname,
+        user=user,
+        password=password,
+        host=host,
+        port=port,
+    )
+else:
+    DB = SqliteDatabase('predictions.db')
+
+
+class Prediction(Model):
+    observation_id = IntegerField(unique=True)
+    observation = TextField()
+    proba = FloatField()
+    true_class = IntegerField(null=True)
+
+    class Meta:
+        database = DB
+
+
+DB.create_tables([Prediction], safe=True)
+
+# End database stuff
+########################################
+
+########################################
+# Unpickle the previously-trained model
+
+
+with open('columns.json') as fh:
+    columns = json.load(fh)
+
+pipeline = joblib.load('model.pickle')
+
+with open('dtypes.pickle', 'rb') as fh:
+    dtypes = pickle.load(fh)
+
+
+# End model un-pickling
+########################################
+
+
+########################################
+# Begin webserver stuff
+
+app = Flask(__name__)
+
+
+@app.route('/predict', methods=['POST'])
+def predict():
+    # flask provides a deserialization convenience function called
+    # get_json that will work if the mimetype is application/json
+    obs_dict = request.get_json()
+    _id = obs_dict['id']
+    observation = obs_dict['observation']
+    # now do what we already learned in the notebooks about how to transform
+    # a single observation into a dataframe that will work with a pipeline
+    obs = pd.DataFrame([observation], columns=columns).astype(dtypes)
+    # now get ourselves an actual prediction of the positive class
+    proba = pipeline.predict_proba(obs)[0, 1]
+    response = {'proba': proba}
+    p = Prediction(
+        observation_id=_id,
+        proba=proba,
+        observation=request.data
+    )
+    try:
+        p.save()
+    except IntegrityError:
+        error_msg = 'Observation ID: "{}" already exists'.format(_id)
+        response['error'] = error_msg
+        print(error_msg)
+        DB.rollback()
+    return jsonify(response)
+
+
+@app.route('/update', methods=['POST'])
+def update():
+    obs = request.get_json()
+    try:
+        p = Prediction.get(Prediction.observation_id == obs['id'])
+        p.true_class = obs['true_class']
+        p.save()
+        return jsonify(model_to_dict(p))
+    except Prediction.DoesNotExist:
+        error_msg = 'Observation ID: "{}" does not exist'.format(obs['id'])
+        return jsonify({'error': error_msg})
+
+
+@app.route('/list-db-contents')
+def list_db_contents():
+    return jsonify([
+        model_to_dict(obs) for obs in Prediction.select()
+    ])
+
+
+# End webserver stuff
+########################################
+
+if __name__ == "__main__":
+    app.run(debug=True, port=5000)
diff --git a/best_params.pickle b/best_params.pickle
diff --git a/columns.json b/columns.json
@@ -0,0 +1 @@
+["age", "workclass", "education", "marital_status", "occupation", "race", "sex", "capital_gain", "capital_loss", "hours_per_week", "native_country"]
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		["age", "workclass", "education", "marital_status", "occupation", "race", "sex", "capital_gain", "capital_loss", "hours_per_week", "native_country"]