Skip to content

A professional-grade machine learning pipeline that demonstrates ETL (Extract-Transform-Load) processes, model training, and deployment using modern best practices.

Notifications You must be signed in to change notification settings

damiancodes/ML_API_ChallengeAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ML & Data Engineering Challenge

A professional-grade machine learning pipeline that demonstrates ETL (Extract-Transform-Load) processes, model training, and deployment using modern best practices.

ML Challenge

Project Structure

ml_challenge/
│── data/                       # Sample input/output data
│   └── user_logs.json         # Example JSON dataset
│
│── src/                        # Source code modules
│   ├── __init__.py            # Package initialization
│   ├── ingest.py              # Data loading and validation
│   ├── transform.py           # Data transformation and feature engineering
│   ├── train.py               # Model training and evaluation
│   ├── serialize.py           # Model persistence and versioning
│   ├── api.py                 # FastAPI web service for predictions
│   └── pipeline.py            # Main ETL + ML workflow orchestrator
│
│── tests/                      # Unit tests
│   ├── test_ingest.py         # Data ingestion tests
│   └── test_transform.py      # Data transformation tests
│
│── models/                     # Trained model artifacts
│── Dockerfile                  # ML container definition
│── docker-compose.yml          # Service orchestration
│── requirements.txt            # Python dependencies
│── generate_sample_data.py     # Sample data generator
│── README.md                   # This file

Features

  • Professional Code Structure: Modular design with clear separation of concerns
  • Comprehensive ETL Pipeline: Data ingestion, validation, transformation, and aggregation
  • Advanced Feature Engineering: Time-series features, lag variables, rolling averages
  • Multiple ML Models: Linear Regression and Random Forest with cross-validation
  • Model Versioning: Comprehensive model serialization with metadata and integrity checks
  • RESTful API: FastAPI-based web service for real-time predictions
  • Containerization: Docker support with health checks and orchestration
  • Testing: Unit tests with pytest framework
  • Logging: Comprehensive logging throughout the pipeline
  • Error Handling: Robust error handling with custom exceptions

Requirements

  • Python 3.11+
  • Docker and Docker Compose (for containerized deployment)
  • 4GB+ RAM (for model training)

Installation

Option 1: Local Development

  1. Clone the repository

    git clone <your-repo-url>
    cd ml_challenge
  2. Create virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt

Option 2: Docker Deployment

  1. Build and run with Docker Compose
    docker-compose up --build

Quick Start

1. Generate Sample Data

python generate_sample_data.py

This creates data/user_logs.json with sample user activity data.

2. Run the Complete Pipeline

python -m src.pipeline

This executes the full ETL + ML workflow:

  • Data ingestion and validation
  • Data transformation and feature engineering
  • Model training and evaluation
  • Model serialization

3. Start the Web API

python -m uvicorn src.api:app --host 0.0.0.0 --port 5000

Access the API at:

🔧 Configuration

The pipeline can be configured through the MLPipeline class:

config = {
    'data_path': 'data/user_logs.json',
    'model_dir': 'models',
    'timeframe_days': 30,
    'test_size': 0.2,
    'random_state': 42
}

pipeline = MLPipeline(config)
pipeline.run_pipeline()

API Usage

Single Prediction

curl -X POST "http://localhost:5000/predict" \
     -H "Content-Type: application/json" \
     -d '{
       "day_of_week": 1,
       "month": 6,
       "is_weekend": 0,
       "total_events": 85,
       "avg_hour": 14.5,
       "rolling_7d_users": 120.3,
       "rolling_7d_events": 95.7
     }'

Batch Prediction

curl -X POST "http://localhost:5000/batch-predict" \
     -H "Content-Type: application/json" \
     -d '[
       {
         "day_of_week": 1,
         "month": 6,
         "is_weekend": 0,
         "total_events": 85,
         "avg_hour": 14.5,
         "rolling_7d_users": 120.3,
         "rolling_7d_events": 95.7
       }
     ]'

Testing

Run the test suite:

# Run all tests
pytest

# Run with coverage
pytest --cov=src

# Run specific test file
pytest tests/test_ingest.py

Docker Commands

Build and Run

# Build the image
docker build -t ml-challenge .

# Run the container
docker run -p 5000:5000 ml-challenge

# Run with Docker Compose
docker-compose up --build

Service Profiles

# Run only the API service
docker-compose up ml-api

# Run the data processing pipeline
docker-compose --profile pipeline up

# Generate sample data
docker-compose --profile data-gen up data-generator

Pipeline Components

1. Data Ingestion (src/ingest.py)

  • JSON data loading with validation
  • Data quality checks
  • Summary statistics generation

2. Data Transformation (src/transform.py)

  • Timestamp parsing and feature extraction
  • Timeframe filtering (configurable via varOcg)
  • Daily aggregation and metrics calculation
  • Advanced feature engineering (lag, rolling averages, trends)

3. Model Training (src/train.py)

  • Feature preparation and scaling
  • Multiple model training (Linear Regression, Random Forest)
  • Cross-validation and evaluation metrics
  • Model comparison and selection

4. Model Serialization (src/serialize.py)

  • Model persistence with joblib
  • Metadata tracking and versioning
  • File integrity checks
  • Backup and restoration capabilities

5. Web API (src/api.py)

  • FastAPI-based REST service
  • Real-time predictions
  • Batch processing support
  • Health monitoring and model status

6. Pipeline Orchestration (src/pipeline.py)

  • End-to-end workflow coordination
  • Step-by-step execution
  • State tracking and logging
  • Error handling and recovery

Model Training & Selection

Overview

The ML pipeline trains multiple models and automatically selects the best performing one for production use.

Models Trained

  1. Linear Regression (sklearn.linear_model.LinearRegression)

    • Fast, interpretable, good for linear relationships
    • Uses 16 engineered features
    • Suitable for continuous target prediction
  2. Random Forest (sklearn.ensemble.RandomForestRegressor)

    • Robust, handles non-linear relationships
    • Feature importance analysis
    • Good for complex patterns

Model Selection Process

The system automatically selects the best model based on:

  • Cross-validation scores (5-fold CV)
  • Multiple metrics: RMSE, R², MAE, MAPE
  • Performance consistency across folds
  • Computational efficiency considerations

Current Model Status

  • Selected Model: Linear Regression
  • Model Files:
    • best_model.joblib - Trained model
    • best_model_scaler.joblib - Feature scaler
  • Features Used: 16 engineered features
  • Training Data: 3,026 records
  • Model Performance: Available in models/best_model_results.json

Feature Engineering

The model uses 16 carefully engineered features:

  • Temporal: day_of_week, month, is_weekend
  • Event-based: total_events, avg_hour, std_hour
  • Behavioral: weekend_ratio, rolling averages
  • Historical: lag features, moving averages, trends

Model Persistence

  • Serialization: Using joblib for efficient storage
  • Versioning: Model metadata and performance tracked
  • Integrity: SHA-256 hash verification
  • Backup: Automatic backup before updates

API Integration

  • Real-time Predictions: Single and batch prediction endpoints
  • Model Status: Health checks and model information
  • Feature Validation: Input validation for all 16 features
  • Error Handling: Graceful degradation on model issues

Monitoring and Logging

  • Pipeline State: Stored in pipeline_state.json
  • Logs: Written to pipeline.log and console
  • Model Artifacts: Stored in models/ directory
  • Health Checks: Available at /health endpoint

Error Handling

The pipeline includes comprehensive error handling:

  • Custom Exceptions: Specific error types for each module
  • Graceful Degradation: Pipeline continues with available data
  • Detailed Logging: Comprehensive error information
  • State Persistence: Pipeline state saved even on failure

Key Variables

As required by the challenge:

  • __define-ocg__: Present in all module comments
  • varOcg: Set to 30 (timeframe in days)
  • varFiltersCg: Context-specific identifier for each module

Development

Adding New Features

  1. Create new module in src/ directory
  2. Add tests in tests/ directory
  3. Update requirements.txt if new dependencies needed
  4. Update Dockerfile if system dependencies required

Code Style

  • Follow PEP 8 guidelines
  • Include comprehensive docstrings
  • Use type hints throughout
  • Implement proper error handling

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Submit a pull request

License

This project is part of the ML & Data Engineering Challenge.

Support

For issues and questions:

  1. Check the logs in pipeline.log
  2. Review the API documentation at /docs
  3. Check the pipeline state in pipeline_state.json
  4. Verify Docker container health with docker-compose ps

About

A professional-grade machine learning pipeline that demonstrates ETL (Extract-Transform-Load) processes, model training, and deployment using modern best practices.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published