Binary_Classification_On_Census_Data

A comprehensive analysis on different classification techniques using adult census data from US.

Project Description

This project employs several machine learning models, including logistic regression, Linear Discriminant Analysis (LDA), decision trees, k-nearest neighbors (KNN), and Naive Bayes. Each model will be implemented and evaluated comprehensively to compare their performance in terms of accuracy, precision, recall, F1-score, and computational efficiency.

(back to top)

About Data Set

Overview

The Census Income dataset is a collection of individual records representing various attributes of individuals, including demographic, educational, and employment-related information. This dataset is often used for classification tasks, specifically for predicting whether an individual's income exceeds $50,000 (">50K") or not ("<=50K").

(back to top)

Dataset Information

Data Columns

ID: An identifier for each individual.
age: The age of the individual.
workclass: The type of workclass the individual belongs to (e.g., State-gov, Self-emp-not-inc, Private, etc.).
fnlwgt: The final weight, which represents the number of people the census believes the entry represents.
education: The highest level of education completed by the individual.
education_num: A numerical representation of education, often equivalent to the years of education.
marital_status: The marital status of the individual.
occupation: The type of occupation the individual is engaged in.
relationship: The individual's relationship status (e.g., Husband, Not-in-family, Own-child, etc.).
race: The race of the individual.
sex: The gender of the individual.
capital_gain: The amount of capital gains the individual has.
capital_loss: The amount of capital losses the individual has.
hr_per_wk: The number of hours worked per week.
native_country: The native country of the individual.
class: The target variable, indicating whether the individual's income exceeds $50,000 (">50K") or not ("<=50K").

(back to top)

Data Sample

Here's a sample of the first few records in the dataset:

ID	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hr_per_wk	native_country	class
1	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
2	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
3	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
4	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
5	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K
6	37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	40	United-States	<=50K

(back to top)

Purpose

This dataset is often used for tasks related to income prediction, demographic analysis, and employment trends. It can be used to build predictive models to classify individuals into income categories and gain insights into the factors affecting an individual's income.

Dataset Source

The dataset is commonly used in machine learning and data science and can be found on various platforms and data repositories.

Please note that this is a simplified dataset description for use on GitHub or other platforms for data sharing and collaboration. If you have more detailed information or specific instructions for using this dataset, please provide them separately.

(back to top)

Data Exploration

In this phase of the project, I have tried to resolve common data challenges faced such as poor data quality, multicolinearity, and correlation between pair of variables. The key insights are as follows -

Insight	Visualization
Plot1: Age v/s Income Bracket Higher is the income category, higher is the age of the respondent.
Plot2: Level of Education v/s Income Bracket Income declines when the number of years of education is less than 12.
Plot3: Correlation Plot Multicolinearity does not exist here.
Plot4: Gender v/s Income Bracket Surprisingly, the proportion of female individuals with income >50K is less than half of that of male individuals.

(back to top)

Modeling Approach

Steps -

Data Import and Initial Exploration
Data Exploration: Numerical and Graphical Summaries, Correlation Plot, Summary Statistics for Numeric Variables
Visual Exploration with ggplot2
Data Preparation and Train-Test Split (70-30)
Model Training and Evaluation
Model Evaluation: Predictions on Test Set, Confusion Matrix, Classification Metrics
Report Performance Metrics
Identify best performing model

(back to top)

Model Performance

Overall, logistic regression model has outperformed all the other models.

Model Comparison

Logistic Regression: High accuracy with low variance; high kappa indicating good agreement beyond chance.
Linear Discriminant Analysis (LDA): High accuracy similar to Logistic Regression; comparable kappa, suggesting similar performance in terms of agreement.
Decision Tree: Lower accuracy and kappa compared to Logistic Regression and LDA, indicating less reliability.
K-Nearest Neighbors (KNN): High accuracy with some variance; kappa is also high, close to Logistic Regression and LDA, indicating good reliability.
Naive Bayes: Highest accuracy among the models with very low variance; kappa is lower than its accuracy, indicating it's somewhat reliable but not as much as Logistic Regression, LDA, or KNN.

(back to top)

Conclusion

Logistic and LDA: Both perform well with high accuracy and Kappa, making them reliable choices.
KNN and Decision Tree: Also performs well, with accuracy and Kappa close to Logistic and LDA.
Naive Bayes: While very accurate, its reliability in terms of agreement (Kappa) is less than the best performers.

(back to top)

Author

@Abbas S.

License

The MIT License (MIT)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

(back to top)

Acknowledgments

Inspiration, code snippets, etc.

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
code		code
data		data
plots		plots
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Binary_Classification_On_Census_Data

Table of Contents

Project Description