Skip to content

A end-to-end data analysis pipeline including model deployment

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit

08d3422 · Mar 31, 2019


13 Commits
Mar 22, 2019
Mar 22, 2019
Mar 22, 2019
Mar 22, 2019
Mar 22, 2019
Mar 22, 2019
Mar 22, 2019
Mar 31, 2019
Mar 22, 2019
Mar 31, 2019
Mar 22, 2019
Mar 22, 2019
Mar 22, 2019
Mar 22, 2019
Mar 22, 2019
Mar 22, 2019
Mar 31, 2019
Mar 22, 2019
Mar 22, 2019

Repository files navigation

Adult Income Prediction

Follow the steps provided below to reproduce the whole project.

Setting up a virtual environment

We'll use a virtual environment for this one. All of the necessary dependencies exist in requirements.txt.

Install pipenv using the instructons given in this repository.

After installing pipenv, from the project folder (i.e., where this readme lives in your computer), create a local virtual environment and install the project dependencies from inside the root directory using requirements.txt with the command:

$ pipenv install -r requirements.txt

We would need to activate the virtual environment to start working inside it. To activate this project's virtualenv, run the following:

$ pipenv shell

You can deactivate the virtualenv by either typing exit or pressing CTRL+d

Now you should have everything installed that we need.

Data format before cleaning

This information is directly copied from the UCI datasets repository for adult dataset.

  • income: >50K, <=50K.
  • age: continuous.
  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • fnlwgt: continuous.
  • education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • education-num: continuous.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  • sex: Female, Male.
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours-per-week: continuous.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

Run the EDA script to export cleaned data

R -e rmarkdown::render"('eda_adult.Rmd', clean=TRUE, output_format='pdf_document')"

# use this command to generate a markdown
R -e rmarkdown::render"('eda_adult.Rmd', clean=TRUE, output_format='github_document')"

The scripts and eda_adult.pdf (choose any format) will contain the exploratory analysis report on the adult income dataset.

Data format after cleaning

The data format has been converted to numbers for each variable and binned together (some categorical variables) into different categories (passed as strings) which are as defined below:

  • income: 0 ('>50K'), 1 ('<=50K')
  • age: continuous.
  • workclass: 0 ('State-gov', 'Federal-gov', 'Local-gov'), 1 ('Self-emp-not-inc', 'Self-emp-inc', 'Without-pay', 'Never-worked'), 2 ('Private'), -1 ('unknown')
  • education: 0 ("HS-grad", "11th", "9th", "7th-8th", "5th-6th", "10th", "Preschool", "12th", "1st-4th"), 1 ("Bachelors", "Some-college", "Assoc-acdm", "Assoc-voc"), 2 ("Masters", "Prof-school", "Doctorate", "Assoc-voc")
  • marital_status: 0 ('Married-civ-spouse', 'Married-spouse-absent', 'Married-AF-spouse'), 1 ('Never-married','Divorced', 'Separated','Widowed')
  • occupation: 0 ("Priv-house-serv", "Handlers-cleaners", "Other-service", "Armed-Forces", "Machine-op-inspct", "Farming-fishing", "Adm-clerical"), 1 ("Tech-support", "Craft-repair", "Protective-serv", "Transport-moving", "Sales"), 2 ("Exec-managerial", "Prof-specialty"), -1 ("unknown")
  • race: 0 ("White"), 1 ("Black"), 2 ("Asian-Pac-Islander", "Amer-Indian-Eskimo", "Other")
  • sex: 0 ("Female"), 1 ("Male")
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours-per-week: continuous.
  • native_country: 1 ("United-States"), 0 (all the rest of countries)

Note: unknown refers to the cells that were missing in the respective columns (?)

Model choice

Execute the script which will train a model on the cleaned data and export a it along with the required data info for deployment on Heroku.

python3 incomePrediction/

I chose the LogisticRegression classifier from scikit-learn to get predictions (The test accuracy obtained is quite well ~ 85%). Cross-validation is done to choose the important hyperparameter (C) to control the degree of regularization. The script can be modified to use and tune any classifier available in scikit-learn. Both the training and test accuracies are comparable and hence, there seems to be no overfitting. I chose to go with Logistic Regression because it is a simple linear classifier whose results are interpretable and this is what I would expect from a model on such a dataset where the predictor-response relationship seems to be important in the analysis. I also tried building and tuning a RandomForest classifier and there was a 1% increase in the accuracies which is not much higher and therefore, a simpler model is a better choice.

Deploy the model on Heroku

The exported model is deployed as a microservice on Heroku using the steps given in this repository. The file is adapted from the same repository.

Steps to get predictions

The model has been deployed on heroku and to get the predictions, you can use the following curl request:

curl -X POST -d '{"id": 10, "observation": {"age": 39, "workclass": "2", "education": "2", "marital_status": "0", "occupation": "2", "race" : "0", "sex": "1", "capital_gain": 1230, "capital_loss": 0, "hours_per_week": 55, "native_country": "1"}}' -H "Content-Type:application/json"

Note: Make sure to use a separate observation id for each request (if you want to store the results properly) as we are storing the results in an sqlite database and this would be the primary identifier for the records.

Testing framework

I have used the pytest library to test the Util class. To run the tests use the following command from the root directory of the project:

pytest incomePrediction/tests/

As of now, the tests section is not exhaustive but I did set up a basic test infrastructure.


A end-to-end data analysis pipeline including model deployment







No releases published


No packages published
