This project builds a logistic regression model that assigns each lead a score from 0–100 reflecting conversion likelihood. Sales teams can use those scores to prioritize outreach on high-probability leads, which raises effective conversion compared to treating all leads equally.
- Data cleaning — Handle missing values, duplicates, and placeholder levels (e.g. categorical “Select”).
- EDA — Explore distributions, relationships to conversion, and data quality.
- Feature engineering — Encode categoricals, treat outliers, and prepare inputs for modeling.
- Logistic regression — Train and evaluate with scikit-learn and statsmodels (including RFE-based feature selection where used).
- Lead scoring (0–100) — Map model output to an interpretable score so “hot” leads rank higher than “cold” ones.
When prioritizing leads by model score, the target conversion rate on the focused segment is about 80%, versus a ~30% baseline conversion rate across all leads—demonstrating stronger sales targeting than uniform outreach.
Roughly 9,000 historical leads with attributes such as Lead Source, Total Time Spent on Website, Total Visits, Last Activity, and other marketing and profile fields. The binary target is Converted (1 = converted, 0 = not). Column definitions are documented in Leads Data Dictionary.xlsx (included in the repository).
Python · Pandas · NumPy · scikit-learn · Matplotlib · Seaborn
(Notebooks may also use statsmodels for inference-style logistic regression and feature refinement.)
Clone the repository, install dependencies for the notebooks you plan to run (e.g. pandas, numpy, scikit-learn, matplotlib, seaborn, and optionally statsmodels), then start Jupyter:
jupyter notebookOpen a notebook from the repo root and run cells in order. Ensure Leads.csv is in the working directory expected by the notebook paths.
Leads.csv— Lead-level records used for modeling.Leads Data Dictionary.xlsx— Field definitions for the dataset.Lead scoring case study .ipynb/LeadScoringCaseStudy-Ver2.ipynb— Analysis and model notebooks (see below).
| File | Description |
|---|---|
Lead scoring case study .ipynb |
Earlier walkthrough: loading and cleaning data, profiling-style exploration (e.g. pandas-profiling), and foundational EDA. |
LeadScoringCaseStudy-Ver2.ipynb |
Structured end-to-end pipeline: data cleaning, EDA (uni- and bivariate), dummy variables, scaling, logistic regression with RFE/statsmodels refinement, evaluation (including ROC), and lead scores / hot-lead identification on holdout data. |
This project is licensed under the MIT License (Copyright © 2021 Anmol Jaiswal).