Skip to content

martinkl164/log-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Log Intent Classifier

FastAPI service that classifies error messages by intent (not just log level). It uses TF-IDF + Logistic Regression to map an error string into 10 practical categories (e.g., NULL_REFERENCE, NETWORK_CONNECTION, RESOURCE_EXHAUSTION), returning a confidence score + suggested action.

Log classifier (before → after)

Important: No LLM Required

This solution uses traditional machine learning (TF-IDF + Logistic Regression), NOT a large language model:

  • No model download required - the model is trained locally from scratch using your training data
  • No GPU needed - runs on any CPU
  • No internet required after initial pip install - fully offline capable
  • Training takes seconds - not hours
  • Model size is tiny - a few KB, not GB

Features

  • RESTful API with FastAPI
  • Semantic error classification by intent (not just log levels)
  • 10 human-readable error categories
  • Confidence scores and probability distributions
  • Category descriptions and suggested actions for each error type
  • Single and batch classification endpoints
  • Health check and category listing endpoints
  • Interactive API documentation (Swagger UI)
  • Enhanced model with feature engineering and hyperparameter tuning

Error Categories

The API classifies errors into 10 semantic categories:

Category Description Example
NULL_REFERENCE Null pointer or reference errors "NullPointerException in UserService"
RESOURCE_EXHAUSTION Out of memory, disk full, connection pool exhausted "OutOfMemoryError: Java heap space"
NETWORK_CONNECTION Connection timeout, refused, unreachable "Connection timeout after 30 seconds"
DUPLICATE_CONFLICT Entry already exists, unique constraint violation "Duplicate key error: email already exists"
DATABASE_ERROR SQL errors, deadlocks, transaction failures "SQLException: table 'users' does not exist"
TIMEOUT Operation exceeded time limit "Request timeout: exceeded 30 second limit"
CONFIGURATION Missing/invalid configuration "Required property 'database.url' not found"
NOT_FOUND Resource, file, or endpoint not found "FileNotFoundException: config.properties"
AUTHENTICATION Invalid credentials, access denied "Authentication failed: invalid password"
VALIDATION Input validation or format errors "Invalid email format"

Project Structure

log-intent-classifier/
├── .gitignore                # Ignore venv, generated models, etc.
├── requirements.txt          # Python dependencies
├── README.md                 # This file
├── run.py                    # API server entry point
├── test_server.py            # Minimal server smoke-test (optional)
├── assets/
│   └── log_classifier_before_after.png  # README image
├── data/
│   └── training_data.csv     # Training data (~98 examples)
├── models/                   # Saved model files (created after training; ignored by git)
│   ├── feature_union.pkl     # Feature pipeline (TF-IDF + engineered features)
│   └── logistic_model.pkl    # Trained Logistic Regression model
├── scripts/
│   └── generate_linkedin_image.py  # Generates the README image (optional)
└── src/
    ├── __init__.py
    ├── train_model.py        # Training script with hyperparameter tuning
    ├── classifier.py         # Error classifier with category metadata
    └── api.py                # FastAPI application

Installation & Setup

Prerequisites

  • Python 3.9 or higher installed
  • pip (Python package manager)

Step 1: Setup Environment

# Navigate to project folder
cd "C:\path\to\log-intent-classifier"

# Create virtual environment
python -m venv venv

# Activate virtual environment (Windows PowerShell)
.\venv\Scripts\Activate.ps1

# Install dependencies (downloads packages from PyPI, ~50MB total)
pip install -r requirements.txt

Note: If you get an execution policy error in PowerShell, run:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Step 2: Train the Model

Train the model using the provided training data:

python -m src.train_model

What happens:

  • Reads data/training_data.csv (~98 examples across 10 categories)
  • Enhanced text preprocessing (normalizes whitespace, extracts error patterns)
  • Extracts engineered features (error type indicators, message stats, special patterns)
  • Builds TF-IDF vocabulary with optimized parameters (10,000 features)
  • Performs hyperparameter tuning with GridSearchCV (5-fold cross-validation)
  • Trains Logistic Regression classifier with best parameters
  • Saves 2 files to models/ folder:
    • feature_union.pkl (~50KB) - Complete feature pipeline
    • logistic_model.pkl (~10KB) - Trained classifier
  • Prints comprehensive evaluation metrics

Expected output:

Loading training data...
Preprocessing text data...
Building feature pipeline...
Extracting features...
Hyperparameter tuning with GridSearchCV...
Best parameters: {'C': 10, 'class_weight': 'balanced', 'solver': 'liblinear'}
Best cross-validation score: 0.95+

Test set accuracy: 0.95+
Classification Report: (precision, recall, F1-score per category)
Confusion Matrix: (shows classification performance)
Cross-validation F1 scores: [...]
Mean CV F1 score: 0.95+ (+/- ...)

Saving models...
Feature union pipeline saved to: models/feature_union.pkl
Logistic Regression model saved to: models/logistic_model.pkl

Training complete!

Step 3: Start the API Server

python run.py

Expected output:

INFO:     Loading models from models/...
INFO:     Error intent classification models loaded successfully!
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO:     Started server process
INFO:     Waiting for application startup.
INFO:     Application startup complete.

The API is now running at http://127.0.0.1:8000

Usage

Interactive API Documentation

Open your browser and navigate to:

API Endpoints

1. Health Check

Invoke-RestMethod -Uri "http://127.0.0.1:8000/health"

Response:

{
  "status": "healthy",
  "models_loaded": true,
  "api_version": "2.0.0"
}

2. Get Error Categories

Invoke-RestMethod -Uri "http://127.0.0.1:8000/error-categories"

Response:

{
  "categories": [
    {
      "name": "NULL_REFERENCE",
      "description": "Null pointer or reference errors when accessing uninitialized objects",
      "suggested_action": "Check for null values before accessing object members, add null checks"
    },
    {
      "name": "NETWORK_CONNECTION",
      "description": "Network connectivity issues (timeout, refused, unreachable)",
      "suggested_action": "Check network connectivity, verify service availability, review firewall rules"
    }
    // ... (8 more categories)
  ],
  "total_count": 10
}

3. Classify Single Error Message

$body = @{
    error_message = "Connection timeout: failed to connect to database after 30 seconds"
    include_metadata = $true
} | ConvertTo-Json

Invoke-RestMethod -Uri "http://127.0.0.1:8000/classify-error" `
  -Method POST `
  -ContentType "application/json" `
  -Body $body

Response:

{
  "error_category": "NETWORK_CONNECTION",
  "confidence": 0.89,
  "all_probabilities": {
    "NULL_REFERENCE": 0.01,
    "RESOURCE_EXHAUSTION": 0.02,
    "NETWORK_CONNECTION": 0.89,
    "DUPLICATE_CONFLICT": 0.01,
    "DATABASE_ERROR": 0.02,
    "TIMEOUT": 0.03,
    "CONFIGURATION": 0.01,
    "NOT_FOUND": 0.01,
    "AUTHENTICATION": 0.00,
    "VALIDATION": 0.00
  },
  "description": "Network connectivity issues (timeout, refused, unreachable)",
  "suggested_action": "Check network connectivity, verify service availability, review firewall rules"
}

4. Classify Multiple Error Messages (Batch)

$body = @{
    error_messages = @(
        "NullPointerException in UserService.getUser()",
        "Duplicate key error: user with email already exists",
        "Invalid input: age must be between 0 and 120"
    )
    include_metadata = $true
} | ConvertTo-Json

Invoke-RestMethod -Uri "http://127.0.0.1:8000/classify-errors" `
  -Method POST `
  -ContentType "application/json" `
  -Body $body

Response:

{
  "results": [
    {
      "error_category": "NULL_REFERENCE",
      "confidence": 0.92,
      "all_probabilities": {...},
      "description": "Null pointer or reference errors...",
      "suggested_action": "Check for null values before accessing..."
    },
    {
      "error_category": "DUPLICATE_CONFLICT",
      "confidence": 0.87,
      "all_probabilities": {...},
      "description": "Duplicate entries or unique constraint violations",
      "suggested_action": "Check for existing records before insertion..."
    },
    {
      "error_category": "VALIDATION",
      "confidence": 0.85,
      "all_probabilities": {...},
      "description": "Input validation or data format errors",
      "suggested_action": "Validate input format, check data constraints..."
    }
  ]
}

Using cURL (Alternative)

# Single classification
curl -X POST "http://127.0.0.1:8000/classify-error" \
  -H "Content-Type: application/json" \
  -d '{"error_message": "Connection timeout after 30 seconds", "include_metadata": true}'

# Batch classification
curl -X POST "http://127.0.0.1:8000/classify-errors" \
  -H "Content-Type: application/json" \
  -d '{"error_messages": ["NullPointerException...", "Duplicate key..."], "include_metadata": true}'

Use Cases

1. Error Monitoring & Aggregation

Automatically categorize incoming errors for better monitoring:

# Classify errors from application logs
errors = ["Connection timeout...", "NullPointerException...", ...]
response = classify_errors_batch(errors)

# Group by category for dashboard
error_counts = {}
for result in response['results']:
    category = result['error_category']
    error_counts[category] = error_counts.get(category, 0) + 1

2. Intelligent Alerting

Route alerts to appropriate teams based on error type:

result = classify_error("Database deadlock detected")
if result['error_category'] == 'DATABASE_ERROR':
    alert_database_team(result)
elif result['error_category'] == 'NETWORK_CONNECTION':
    alert_infrastructure_team(result)

3. Automated Ticket Routing

Create tickets with proper categorization and suggested actions:

result = classify_error(error_message)
create_ticket(
    category=result['error_category'],
    description=result['description'],
    suggested_action=result['suggested_action'],
    priority=calculate_priority(result['confidence'])
)

4. Error Trend Analysis

Analyze error patterns over time:

# Classify all errors from the past week
errors_this_week = get_errors_from_logs()
classifications = classify_errors_batch(errors_this_week)

# Identify trending error categories
analyze_trends(classifications)

Customizing Training Data

To improve classification for your specific errors:

  1. Edit data/training_data.csv
  2. Add your own error examples with correct categories
  3. Ensure CSV format: error_message,error_category
  4. Use these category names: NULL_REFERENCE, RESOURCE_EXHAUSTION, NETWORK_CONNECTION, DUPLICATE_CONFLICT, DATABASE_ERROR, TIMEOUT, CONFIGURATION, NOT_FOUND, AUTHENTICATION, VALIDATION
  5. Re-run training: python -m src.train_model
  6. Restart the API server

Example:

error_message,error_category
"Custom null error in MyService",NULL_REFERENCE
"My application out of memory",RESOURCE_EXHAUSTION

How It Works

Model Architecture

  1. Enhanced Preprocessing:

    • Normalizes whitespace and removes punctuation
    • Preserves important error patterns (exception names, error codes)
  2. Feature Engineering:

    • Extracts error type indicators (null, timeout, not found, etc.)
    • Calculates message statistics (character count, word count)
    • Detects special patterns (IP addresses, file paths, numbers)
  3. TF-IDF Vectorization:

    • Converts error text into numerical features using optimized parameters
    • Uses unigrams and bigrams for better context
    • Vocabulary size: 10,000 features
    • Applies sublinear term frequency scaling
  4. Feature Combination:

    • Combines TF-IDF features with engineered features using FeatureUnion
    • Scales engineered features for optimal performance
  5. Logistic Regression:

    • Hyperparameter-tuned with GridSearchCV (5-fold cross-validation)
    • Uses L2 regularization to prevent overfitting
    • Optimized C parameter and solver selection
  6. Classification:

    • Predicts error category with confidence score
    • Returns probabilities for all categories
    • Provides human-readable descriptions and suggested actions

Why Semantic Intent Classification?

Traditional approach:

"ERROR: Connection timeout" → category: ERROR

Semantic intent approach:

"Connection timeout after 30 seconds" → category: NETWORK_CONNECTION
  + description: "Network connectivity issues (timeout, refused, unreachable)"
  + suggested_action: "Check network connectivity, verify service availability"

Benefits:

  • Actionable insights: Know what type of error occurred, not just severity
  • Better routing: Send errors to the right team (DB team, network team, etc.)
  • Improved monitoring: Track specific error patterns (e.g., spike in TIMEOUT errors)
  • Human-friendly: Categories are self-explanatory

How Classification Works (Simple Explanation)

Think of it like a smart email spam filter, but for error messages

Step 1: When You Call the API

You send an error message to /classify-error:

POST /classify-error
{
  "error_message": "Database connection timeout after 30 seconds"
}

Step 2: Text Cleaning

Your message gets cleaned up:

  • Converted to lowercase
  • Special characters normalized
  • Important patterns preserved
"Database connection timeout after 30 seconds" 
  → "database connection timeout after 30 seconds"

Step 3: Converting to Numbers

Computers can't think about words directly, so the text is converted to numbers:

  • Each word gets a score based on how important/rare it is
  • The system learned these scores from the training data
  • This creates a unique "fingerprint" for your error message

Step 4: The AI Makes a Decision

The AI (Logistic Regression model) scores all 10 categories:

Input: "Database connection timeout after 30 seconds"

Scoring:
✅ NETWORK_CONNECTION:  85 points  (has "connection", "timeout")
   DATABASE_ERROR:      45 points  (has "database")
   RESOURCE_EXHAUSTION: 12 points
   TIMEOUT:             12 points
   NULL_REFERENCE:       2 points
   ... others:          1-5 points

Winner: NETWORK_CONNECTION with 85% confidence!

How did it learn to score? During training, the model studied ~98 labeled examples and learned patterns like:

  • Messages with "null", "NullPointer" → usually NULL_REFERENCE
  • Messages with "timeout", "connection failed" → usually NETWORK_CONNECTION
  • Messages with "memory", "exhausted" → usually RESOURCE_EXHAUSTION

Step 5: You Get the Result

{
  "error_category": "NETWORK_CONNECTION",
  "confidence": 0.85,
  "all_probabilities": {
    "NETWORK_CONNECTION": 0.85,
    "DATABASE_ERROR": 0.45,
    "TIMEOUT": 0.12,
    ...
  },
  "description": "Network connectivity issues (timeout, refused, unreachable)",
  "suggested_action": "Check network connectivity, verify service availability"
}

The Learning Process (Training)

When you run python -m src.train_model, here's what happens:

  1. Study Phase: The model reads ~98 example error messages with their correct categories
  2. Pattern Recognition: It finds patterns - which words appear in which categories
  3. Practice Tests: It tests itself with cross-validation to check accuracy
  4. Save Knowledge: It saves what it learned to models/ folder
  5. Ready to Use: When the API starts, it loads this saved knowledge

It's like studying for a test:

  • Training data = study materials
  • Training = studying and memorizing
  • Models folder = brain with memorized knowledge
  • Classification = answering questions on the test

That's it! The system uses math (not magic) to recognize patterns and classify new errors based on what it learned from examples.

Troubleshooting

Models not found error

Error: FileNotFoundError: Feature union pipeline not found

Solution: Run the training script first:

python -m src.train_model

Port already in use

Error: Address already in use

Solution: Change the port in run.py:

uvicorn.run("src.api:app", host="127.0.0.1", port=8001)

Import errors

Error: ModuleNotFoundError

Solution: Ensure virtual environment is activated and dependencies are installed:

.\venv\Scripts\Activate.ps1
pip install -r requirements.txt

Low confidence scores

If confidence scores are consistently low (<0.5):

  • Add more training examples for underrepresented categories
  • Review ambiguous error messages that don't fit clearly into one category
  • Consider adding more specific examples to training data

Model Performance

The model uses several techniques for high accuracy:

  • Feature Engineering: Extracts semantic indicators from error messages
  • TF-IDF Optimization: 10,000 features, bigrams, sublinear scaling
  • Hyperparameter Tuning: GridSearchCV with 5-fold cross-validation
  • Regularization: L2 penalty to prevent overfitting
  • Balanced Classes: Optional class weighting for imbalanced datasets

Expected Performance:

  • Test accuracy: 90-100%
  • Cross-validation F1: 0.90+
  • Per-category precision/recall: 0.85+

Dependencies

  • scikit-learn - Machine learning library (TF-IDF, Logistic Regression, GridSearchCV)
  • pandas - Data manipulation
  • numpy - Numerical operations
  • fastapi - Web framework for API
  • uvicorn - ASGI server
  • joblib - Model serialization
  • pydantic - Data validation

API Migration (v1 to v2)

Breaking Changes

Endpoints renamed:

  • /classify/classify-error
  • /classify/batch/classify-errors
  • /categories/error-categories

Request/Response changes:

  • Request field: log_entryerror_message
  • Response field: categoryerror_category
  • Added: description and suggested_action fields

Backward Compatibility

Legacy endpoints still work:

  • /classify redirects to /classify-error
  • /categories redirects to /error-categories

License

This project is provided as-is for demonstration purposes.

About

FastAPI service that classifies error messages by intent (not just log level). It uses TF-IDF + Logistic Regression to map an error string into 10 practical categories (e.g., NULL_REFERENCE, NETWORK_CONNECTION, RESOURCE_EXHAUSTION), returning a confidence score + suggested action.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages