This repository contains the solution for the Kaggle Playground Series - Season 5, Episode 7 competition. The goal is to predict a person's 16-type personality based on their answers to a custom survey.
This project uses an XGBoost Classifier to build four independent binary models—one for each of the four personality dichotomies—to achieve a robust and accurate prediction.
The 16 personality types (like INTJ, ESFP, etc.) are a combination of four binary traits:
- Introversion (I) vs. Extroversion (E)
- Intuition (N) vs. Sensing (S)
- Feeling (F) vs. Thinking (T)
- Perceiving (P) vs. Judging (J)
Instead of treating this as a complex 16-class classification problem, this solution builds four separate binary classification models. The final personality type is determined by concatenating the results of these four models (e.g., I + N + T + J = INTJ).
- Loading: The
train.csvandtest.csvfiles are loaded usingpandas. - Target Engineering: The single
Personalitycolumn (e.g., 'INFP') in the training data is split into four separate binary target variables:is_I,is_N,is_F, andis_P.'INFP'->is_I=1,is_N=1,is_F=1,is_P=1'ESTJ'->is_I=0,is_N=0,is_F=0,is_P=0
- Features: The survey questions serve as the features (
X). Theidcolumn is dropped.
- An
XGBClassifieris used as the base model for its high performance and speed. - Four models are trained independently:
model_IE: Predictsis_I(Introversion/Extroversion) usingXmodel_NS: Predictsis_N(Intuition/Sensing) usingXmodel_FT: Predictsis_F(Feeling/Thinking) usingXmodel_PJ: Predictsis_P(Perceiving/Judging) usingX
- The
Modeling.ipynbnotebook establishes this baseline approach, whileModeling2.ipynblikely focuses on hyperparameter tuning (e.g.,GridSearchCVorRandomizedSearchCV) to optimize each of the four models.
- The test data (
test.csv) is loaded. - Each of the four trained models (
model_IE,model_NS, etc.) predicts its respective binary class on the test data. - The four binary predictions are mapped back to their letter codes (e.g.,
1->'I',0->'E'). - The final
Personalitystring is created by concatenating the four predicted letters. - The results are formatted into
submission.csvwithidandPersonalitycolumns.
The models were tuned (likely using Bayesian optimization or a similar method, as seen in Modeling2.ipynb) to find the optimal set of parameters.
{
"n_estimators": 299,
"max_depth": 5,
"learning_rate": 0.15841262137302178,
"subsample": 0.8519152889164038,
"colsample_bytree": 0.6808885474211932,
"gamma": 2.0070959113867732,
"reg_alpha": 1.2522715414146957,
"reg_lambda": 2.5895571241033593
}- Best CV Score: 0.9639316464361883
- Tuned Model R2 (Train): 0.9640534903692799
- Tuned Model RMSE: 0.30210824789781193
-
Clone the repository:
git clone [https://github.com/RaymussenArthur/Personality-Prediction.git](https://github.com/RaymussenArthur/Personality-Prediction.git) cd Personality-Prediction -
Install dependencies: It's recommended to use a virtual environment.
pip install pandas numpy scikit-learn xgboost jupyter
-
Get the data:
- Download the
train.csv,test.csv, andsample_submission.csvfiles from the Kaggle competition page. - Place them inside the
/datafolder.
- Download the
-
Run the notebooks:
- Start Jupyter:
jupyter notebook
- Open and run the notebooks in the
Notebooks/directory, starting withModeling.ipynbandModeling2.ipynbto train models and generate a submission.
- Start Jupyter: