Skip to content

Commit

Permalink
checking readme instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
adityashrm21 committed Mar 31, 2019
1 parent 7b2adca commit 8294518
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Adult Income Prediction using Flask app on Heroku
# Adult Income Prediction

Follow the steps provided below to reproduce the whole project.

Expand All @@ -25,7 +25,7 @@ Now you should have everything installed that we need.

### Data format before cleaning

This information is directly copied from the [UCI datasets repository for adult dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names)
This information is directly copied from the [UCI datasets repository for adult dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names).

- income: >50K, <=50K.
- age: continuous.
Expand Down Expand Up @@ -79,7 +79,7 @@ Execute the `main.py` script which will train a model on the cleaned data and ex
```bash
python3 incomePrediction/main.py
```
I chose the `LogisticRegression` classifier from scikit-learn to get predictions (The test accuracy obtained is quite well ~ 85%). Cross-validation is done to choose the important hyperparameter (`C`) to control the degree of regularization. The script can be modified to use and tune any classifier available in `scikit-learn`. Both the training and test accuracies are comparable and hence, there seems to be no overfitting. I chose to go with Logistic Regression because it is a simple linear classifier whose results are interpretable and this is what I would expect from a model on such a dataset where the predictor-response relationship seems to be important in the analysis. I also tried building and tuning a RandomForest classifier and there was a 1 percent increase in the accuracies which is not much higher and therefore, a simpler model is a better choice.
I chose the `LogisticRegression` classifier from scikit-learn to get predictions (The test accuracy obtained is quite well ~ 85%). Cross-validation is done to choose the important hyperparameter (`C`) to control the degree of regularization. The script can be modified to use and tune any classifier available in `scikit-learn`. Both the training and test accuracies are comparable and hence, there seems to be no overfitting. I chose to go with Logistic Regression because it is a simple linear classifier whose results are interpretable and this is what I would expect from a model on such a dataset where the predictor-response relationship seems to be important in the analysis. I also tried building and tuning a RandomForest classifier and there was a 1% increase in the accuracies which is not much higher and therefore, a simpler model is a better choice.

### Deploy the model on Heroku

Expand All @@ -103,4 +103,4 @@ I have used the `pytest` library to test the [Util class](https://github.com/adi
pytest incomePrediction/tests/
```

Due to shortage on time, I could not cover all kinds of tests but I did set up a basic test infrastructure which could be extended to test the remaining code (unit, integration and e2e tests).
As of now, the tests section is not exhaustive but I did set up a basic test infrastructure.

0 comments on commit 8294518

Please sign in to comment.