Skip to content

How to find key drivers that influence output of an Artificial Neural Network

License

Notifications You must be signed in to change notification settings

clkride/Feature_Importance_ANN

Repository files navigation

Bitbucket open issues GitHub commit activity GitHub contributors GitHub watchers GitHub Repo stars GitHub

Feature_Importance_ANN

Aim -

  1. To find key drivers that influence the output of an Artificial Neural Network
  2. To determine the relative importance of these influencing factors

Table of Contents

Project Description

The bank in this case wants to predict whether a customer will subscribe to a term deposit. To make this a successful telemarketing campaign, the Bank would like to know which customers are highly likely to subscribe its offer.

The dataset provided by the bank contains details on the number of days since last contact which captures recency aspect and the number of contacts performed during the present and the previous campaign which captures the frequency aspect of the marketing campaign.

For modelling purpose, I have used the recency and frequency metrics to train my model because these metrics have very high predictive power. As for the model, I have used a binary classifier which gives an output of either 1 (the customer will subscribe) or 0 (the customer will not subscribe).

(back to top)

About Data Set

  • Title: Bank Telemarketing (with social/economic context)

  • Past Usage: The full dataset (bank-additional-full.csv) was described and analyzed in:

    S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems (2014),doi:10.1016/j.dss.2014.03.001.

  • All records are ordered by date (from May 2008 to November 2010). Detailed Description can be found @ Data_Dictionary.md.

(back to top)

Data Exploration

In this phase of the project, I have tried to resolve common data challenges faced such as poor data quality, multicolinearity, and correlation between pair of variables. The key insights are as follows -

Insight                Visualization               
Plot1: Duration of call v/s Subscription

Higher is the duration of the last call,
higher is the probability that the client will subscribe
alt text
Plot2: Histogram of Duration with Subscription Overlay

Subscription declines when the duration of call
is close to 50 min. However, the outcome is most
certainly 'yes' if the duration is close to 65 min.
and 'no' when the duration exceeds 65 minutes.
alt text
Plot3: Months v/s Subscription

Campaigns are most successful in months of -
Dec, Mar, Oct, and Sep.
alt text
Plot4: Job Type v/s Subscription

Surprisingly, students and retired people are
more likely to subscribe for a term deposit.
alt text
Plot5: Previous Outcome v/s Current Subscription

If the outcome of previous campaign was a success
then the propensity of that client to subscribe
the term deposit is fairly high.
alt text
Plot6: Education Level v/s Subscription

Illiterate people are more likely to subscribe than
educated folk. Also, as the level of education
increases the propensity to subscribe increases as well.
alt text

From the preliminary data analysis, I concluded that the duration of the last call, outcome of the previous campaign, and month in which the customer was contacted have significant impact on the final outcome. However, there is no way to conclude which one is more or less important relative to each other.

Detailed Description can be found @ Bank_Marketing_Exploratory_Analysis.ipynb.

(back to top)

Modeling Approach

Steps -

  1. Feature Engineering and Data Transformation of Categorical and Numerical Attributes
  2. Split the dataset into predictors and response variables. In this case, response is whether customer subscribes or not.
  3. Split the dataset into training set (75%) and test set (25%)
  4. Create Model
  5. Fine tune the Hyper-parameters
  6. Test Performance of the Final Model
  7. Report Performance Metrics
  8. Identify most important factors based on socioeconomic characteristics of the customers

(back to top)

Model Performance

Overall, our model has achieved an accuracy of 91.66% for the test set.

The confusion matrix for this classification model is shown below.

Confusion Matrix                Performance Metrics               
alt text Number of False Positives, FP = 438
Number of False Negatives, FN = 392
Number of True Positives, TP = 664
Number of True Negatives, TN = 8457

False Positive Rate (FPR) = FP / (FP + TN) = 0.0492
False Negative Rate (Type2 Error)= FN / (FN + TP) = 0.3712
True Positive Rate (TPR) = TP / (TP + FN) = 0.6287
True Negative Rate (Type1 Error) = TN / (TN + FP) = 0.9507
Accuracy = (TP + TN) / (TP + TN + FP + FN) = 0.9165

(back to top)

Feature Importance

SHAP Values                Explanation               
alt text The horizontal bar plot shows the average impact of
a feature on model output.

Here, duration of the last contact, number of
employees, and whether the last contact month
of the year was May or not contribute the most in estimating customer subscribription rate for a term deposit.

SHAP values measure feature importance at row
level. It represents how a feature influences the
prediction of a single row relative to the other
features in that row and to the average outcome
in the dataset. Features are ranked in the
diminishing order of influence. The size of the bar
plot shows the magnitude of the influence that
feature has on the final outcome.

(back to top)

Summary

  • Optimal Set of Hyperparameters for our Neural Network is given by -
    • neurons = 31
    • learning_rate = 0.1
    • batch_size = 2048
    • optimizer = adam
    • epochs = 100
    • Accuracy = 91.65%
  • Precision for (y=0) = 95%
  • Precision for (y=1) = 63%
  • Duration of last contact is the most influencing factor in determining whether a customer will subscribe to a term deposit.

(back to top)

Author

@Abbas S.

License

The MIT License (MIT)

Copyright (c) 2023 Abbas Singapurwala

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

(back to top)

Acknowledgments

Inspiration, code snippets, etc.

(back to top)

About

How to find key drivers that influence output of an Artificial Neural Network

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published