-
Notifications
You must be signed in to change notification settings - Fork 0
ai_cross_validation_hyperparameter_optimization
The ai_data_balancer.py module addresses the problem of imbalanced datasets in machine learning workflows by implementing techniques like oversampling (via SMOTE) or other configurable strategies. This is a critical preprocessing step to improve model performance and reduce bias toward the majority class.
The associated ai_data_balancer.html provides further explanations, interactive examples, and visual aids for implementing proper data balancing strategies in practical machine learning use cases.
Class imbalance can harm prediction accuracy for minority classes, especially in fields like:
- Fraud Detection (e.g., predicting fraudulent transactions),
- Anomaly Detection,
- Medical Diagnostics (e.g., detecting rare diseases),
- Any task where minority class predictions are critical.
This module provides a standardized and extensible framework to automate data balancing in pipeline-based workflows.
- Introduction
- Purpose
- [Key Features](#Key Features.md)
- [How It Works](#How It Works.md)
- [Rebalancing with SMOTE](#Rebalancing with SMOTE.md)
- [Error Handling for Strategies](#Error Handling for Strategies.md)
- Dependencies
- Usage
- [Basic Example](#Basic Example.md)
- [Advanced Examples](#Advanced Examples.md)
- [Using Custom Resampling Ratios](#Using Custom Resampling Ratios.md)
- [Handling Sparse Datasets](#Handling Sparse Datasets.md)
- [Pipeline with Preprocessing](#Pipeline with Preprocessing.md)
- [Best Practices](#Best Practices.md)
- [Extending the Data Balancer](#Extending the Data Balancer.md)
- [Integration Opportunities](#Integration Opportunities.md)
- [Future Enhancements](#Future Enhancements.md)
The DataBalancer class provides a simple interface to balance imbalanced datasets using the popular SMOTE (Synthetic Minority Oversampling Technique) method, along with extension points for adding other techniques like undersampling or hybrid methods.
The rebalance_data method:
- Takes as input the feature matrix (
X) and label vector (y). - Balances the dataset by generating synthetic samples for minority classes or performing other balancing actions based on the specified strategy.
- Currently supports SMOTE with robust error handling for extensibility.
The core goals of the DataBalancer module include:
- Equalizing the representation of classes in the dataset.
- Reducing overfitting caused by imbalanced data by providing the model with diverse and representative inputs.
- Preventing bias toward majority classes in classification problems.
- Providing flexibility to include various balancing strategies beyond SMOTE.
This is particularly useful in scenarios where predictions for minority classes hold great importance.
The DataBalancer module offers the following features:
- State-of-the-Art Oversampling:
- Includes SMOTE, which generates synthetic samples for minority classes based on their nearest neighbors.
- Extensibility:
- Designed to support additional balancing techniques (e.g., random oversampling, undersampling, hybrid methods) in the future.
- Minimal API Design:
- A single method,
rebalance_data(), manages the balancing process.
- Logging and Error Handling:
- Ensures robust logging and graceful handling of configuration or runtime errors.
- Scikit-learn Data Integration:
- Compatible with Scikit-learn-style datasets and pipelines, making integration into model workflows straightforward.
The DataBalancer module contains a single method, rebalance_data():
- Arguments:
-
X: Input feature matrix. -
y: Label vector. -
strategy: The strategy used for balancing: -
"SMOTE"(default): Applies synthetic oversampling to balance the dataset. - Custom strategies can be added (e.g., undersampling).
- Workflow:
- Logs the selected balancing strategy and initializes the rebalancing process.
- For SMOTE:
- Calculates synthetic samples for minority classes based on nearest neighbors.
- Merges real and synthetic samples to balance the dataset.
- Returns the balanced dataset (
X_balanced,y_balanced).
- Error Handling: Gracefully handles missing data, unsupported strategies, or runtime errors.
The SMOTE implementation works as follows:
- Identifies the minority class using label frequencies.
- Generates synthetic samples for the minority class by interpolating between nearest neighbors.
- Balances the dataset by combining original with synthetic samples.
Example:
smote = SMOTE()
X_balanced, y_balanced = smote.fit_resample(X, y)If an unsupported or invalid strategy is passed, the method logs an appropriate message and returns the original dataset without modification.
Example Logging Output:
ERROR:root:Unknown balancing strategy: undersample
The module requires the following dependencies:
-
pandas: For flexible data manipulation. -
imblearn: For SMOTE and other resampling methods. -
logging: For detailed reporting of the balancing process.
To install the necessary dependencies, use pip:
pip install pandas imbalanced-learnThe following examples demonstrate how to use the DataBalancer module.
Rebalancing a dataset with SMOTE.
import pandas as pd
from sklearn.datasets import make_classification
from ai_data_balancer import DataBalancer
# Generate imbalanced dataset
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9], n_samples=1000, random_state=42)
# Balance the dataset
X_balanced, y_balanced = DataBalancer.rebalance_data(X, y, strategy="SMOTE")
# Display results
print("Original class distribution:", pd.Series(y).value_counts())
print("Balanced class distribution:", pd.Series(y_balanced).value_counts())Output:
INFO:root:Rebalancing data using SMOTE...
Original class distribution:
1 900
0 100
Balanced class distribution:
1 900
0 900
SMOTE allows you to define specific class balance ratios.
from imblearn.over_sampling import SMOTE
# Custom SMOTE with specific sampling ratio
smote = SMOTE(sampling_strategy={0: 500, 1: 900})
X_balanced, y_balanced = smote.fit_resample(X, y)
print("Balanced class distribution:", pd.Series(y_balanced).value_counts())For large sparse datasets, SMOTE can work alongside scipy.sparse data.
from scipy.sparse import csr_matrix
# Convert dataset to sparse format
X_sparse = csr_matrix(X)
# Efficient SMOTE with sparse matrices
X_balanced, y_balanced = DataBalancer.rebalance_data(X_sparse, y)Combine SMOTE with other transformations in a Scikit-learn pipeline.
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from imblearn.pipeline import make_pipeline
pipeline = make_pipeline(StandardScaler(), SMOTE(), LogisticRegression())
pipeline.fit(X, y)- Understand the Data:
- Avoid applying SMOTE or any resampling technique blindly; analyze class distributions first.
- Combine with Preprocessing:
- Incorporate data normalizations (e.g.,
StandardScaler) alongside sampling techniques.
- Validate on Balanced Data:
- Train with the rebalanced data but validate on the original dataset to ensure real-world applicability.
- Choose the Right Metric:
- While balancing data, use metrics like
F1-scoreorAUC-ROCinstead of just accuracy.
To add other strategies (e.g., undersampling), you can extend the rebalance_data method:
Adding Undersampling Example:
from imblearn.under_sampling import RandomUnderSampler
if strategy == "undersample":
rus = RandomUnderSampler()
X_balanced, y_balanced = rus.fit_resample(X, y)
logging.info("Data rebalanced using undersampling.")
return X_balanced, y_balancedThe DataBalancer module can be integrated into:
- AI/ML Pipelines:
- Use it as a preprocessing step in Scikit-learn pipelines.
- AutoML Frameworks:
- Automate balancing for datasets as part of the data cleaning process.
- Data Science Applications:
- Fraud detection, anomaly detection, and medical research rely heavily on balanced datasets.
The following features can further improve the module:
- Hybrid Approaches:
- Combine SMOTE with undersampling to balance datasets efficiently in large imbalanced cases.
- Support for Multiclass SMOTE:
- Extend balancing functionality to handle multiclass datasets.
- Custom Logging and Monitoring:
- Integrate metrics logging for automated dashboards.
- Performance Optimization:
- Support GPU-accelerated resampling with tools like RAPIDS for larger datasets.
The ai_data_balancer.py is part of the G.O.D. Framework. Modification and redistribution are subject to licensing terms outlined by the framework. For support, contact the development team.
The DataBalancer module provides a highly effective solution for handling imbalanced datasets and improving predictive performance. With its easy-to-use interface, compatibility with Scikit-learn workflows, and extensibility options, it is an indispensable tool for modern machine learning pipelines.