The goal of this project is to help a fictitious charity organization (CharityML) to develop a supervised learning model that can accurately predict whether an individual makes more than $50,000 annually. Understanding an individual's income can help this charity better understand which potential donors they should reach out and the amount of donation to request.
The detailed code is in the finding_donors.ipynb
notebook.
The modified census dataset consists of approximately 32,000 data points, with each datapoint having 13 features. This dataset is a modified version of the dataset published in the paper "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", by Ron Kohavi. You may find this paper online, with the original dataset hosted on UCI. More details about the data schema is at the end.
The code was developed using the Anaconda distribution of Python, versions 3.8.1. Python libraries used are numpy
, pandas
, sklearn
, matplotlib
The data is relatively clean, but needs some preprocessing. I performed:
- log transformations on features that are highly skewed
- scaling on numerical features to ensures that each feature is treated equally, so feature importance comparison is easy
- one-hot encoding for categorical features
- shuffle and split into train and test data sets
In addition to a naive predictor, I applied three classifiers of supervised algorithms:
- Decision Tree
- Support Vector Machines
- Ensemble Methods (AdaBoost)
I selected F-score (beta = 0.5) as the metric for evaluating model's performance. Beta is set at 0.5 to put more emphasis on precision. The model's ability to precisely predict those that make more than $50,000 is more important than the model's ability to recall those individuals. In other words, false positive is worse than false negative.
Then I created a training and predicting pipeline to quickly and effectively train models and perform predictions on the testing data. From preliminary results, AdaBoost model proves to be the best classifier in terms of overall accuracy, F-score and training time. I further optimized the model using GridSearchCV. The final model has an accuracy score of .86 and F-score of .73.
I also identified top five most important features that can predict whether an individual makes at most or more than $50,000. They are listed in order below:
- capital-loss: the more to lose, the less chance to donate possibly
- age: the older, the more capital gain possibly
- capital-gain: the more to make, the better chance to donate possibly
- hours-per-week: the more hours per week, the more capital gain possibly
- education-num: the more years education, the more capital gain possibly
I also tried to train the model on the same training set, but with only the top five most important features. The reduced model has an accuracy score of .83 and F-score of .68, 3.2% less accuracy, and 7.7% less F-score compared to the full model. If shorter training time is preferred, I'd consider using the reduced model.
Features
age
: Ageworkclass
: Working Class (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked)education_level
: Level of Education (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool)education-num
: Number of educational years completedmarital-status
: Marital status (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse)occupation
: Work Occupation (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces)relationship
: Relationship Status (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried)race
: Race (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black)sex
: Sex (Female, Male)capital-gain
: Monetary Capital Gainscapital-loss
: Monetary Capital Losseshours-per-week
: Average Hours Per Week Workednative-country
: Native Country (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands)
Target Variable
income
: Income Class (<=50K, >50K)
Special thanks to Udacity for creating this awesome project.