2nd Assignment: Logistic Regression

Assignment has been done by pair of students:

Elizaveta Nosova
Artem Bisliouk

This assignment involves implementing and analyzing logistic regression with both Maximum Likelihood Estimation (MLE) and Maximum Aposteriori Estimation (MAP), using the Spambase dataset for spam email detection. The primary objective is to explore the impact of different regularization strategies and preprocessing steps on model performance and interpretability.

Research Area

Logistic Regression, Maximum Likelihood Estimation (MLE), Maximum Aposteriori Estimation (MAP), Gaussian Prior, Regularization (L2), Gradient Descent, Stochastic Gradient Descent (SGD), Feature Importance, Normalization.

Tasks

Dataset Statistics:
- Implemented kernel density plots for each feature in the dataset to examine distributions and identify potential outliers.
- Normalized the dataset to have mean 0 and variance 1, then re-plotted kernel density distributions to compare changes post-normalization.
Logistic Regression with MLE:
- Derived analytical expressions to demonstrate that rescaling and shifting features do not alter MLE estimates when a bias term is used.
- Completed the log_likelihood and gradient functions to compute the log-likelihood and gradient for logistic regression.
- Implemented gradient descent (GD) for logistic regression, with the option to vectorize epochs for enhanced computational efficiency.
- Extended the implementation to support stochastic gradient descent (SGD) and compared the behavior of both methods.
Prediction and Feature Analysis:
- Developed prediction functions to compute probabilities and classify labels using logistic regression.
- Analyzed the learned weight vector to assess feature importance, identifying features most and least relevant to spam detection.
Logistic Regression with MAP:
- Implemented MAP estimation using a Gaussian prior (L2 regularization), modifying the gradient descent function to incorporate the regularization term.
- Investigated the effect of varying the regularization parameter (\lambda) on training and test log-likelihoods, as well as prediction accuracy.
- Examined the impact of (\lambda) on the weight vector, discussing the effects of strong priors on feature importance and model interpretability.

Dataset Overview

The Spambase dataset consists of 4601 emails, each labeled as either spam (1) or non-spam (0). Each example has 57 features:

48 word frequency features (percentage occurrences of specific words in the email)
6 character frequency features (percentage occurrences of certain characters)
3 uppercase letter sequence features (related to length statistics of uppercase letter sequences)

The dataset is split into:

Training Set: 3065 examples
Test Set: 1536 examples

Structure

Assignment task sheet
Assignment solution notebook .ipynb
Assignment solution notebook .pdf with outputs
Assignment report & Corresponding LateX project
Resulting plots

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

2nd Assignment: Logistic Regression

Research Area

Tasks

Dataset Overview

Structure

Files

README.md

Latest commit

History

README.md

File metadata and controls

2nd Assignment: Logistic Regression

Research Area

Tasks

Dataset Overview

Structure