Log Intent Classifier

FastAPI service that classifies error messages by intent (not just log level). It uses TF-IDF + Logistic Regression to map an error string into 10 practical categories (e.g., NULL_REFERENCE, NETWORK_CONNECTION, RESOURCE_EXHAUSTION), returning a confidence score + suggested action.

Important: No LLM Required

This solution uses traditional machine learning (TF-IDF + Logistic Regression), NOT a large language model:

✅ No model download required - the model is trained locally from scratch using your training data
✅ No GPU needed - runs on any CPU
✅ No internet required after initial pip install - fully offline capable
✅ Training takes seconds - not hours
✅ Model size is tiny - a few KB, not GB

Features

RESTful API with FastAPI
Semantic error classification by intent (not just log levels)
10 human-readable error categories
Confidence scores and probability distributions
Category descriptions and suggested actions for each error type
Single and batch classification endpoints
Health check and category listing endpoints
Interactive API documentation (Swagger UI)
Enhanced model with feature engineering and hyperparameter tuning

Error Categories

The API classifies errors into 10 semantic categories:

Category	Description	Example
NULL_REFERENCE	Null pointer or reference errors	"NullPointerException in UserService"
RESOURCE_EXHAUSTION	Out of memory, disk full, connection pool exhausted	"OutOfMemoryError: Java heap space"
NETWORK_CONNECTION	Connection timeout, refused, unreachable	"Connection timeout after 30 seconds"
DUPLICATE_CONFLICT	Entry already exists, unique constraint violation	"Duplicate key error: email already exists"
DATABASE_ERROR	SQL errors, deadlocks, transaction failures	"SQLException: table 'users' does not exist"
TIMEOUT	Operation exceeded time limit	"Request timeout: exceeded 30 second limit"
CONFIGURATION	Missing/invalid configuration	"Required property 'database.url' not found"
NOT_FOUND	Resource, file, or endpoint not found	"FileNotFoundException: config.properties"
AUTHENTICATION	Invalid credentials, access denied	"Authentication failed: invalid password"
VALIDATION	Input validation or format errors	"Invalid email format"

Project Structure

log-intent-classifier/
├── .gitignore                # Ignore venv, generated models, etc.
├── requirements.txt          # Python dependencies
├── README.md                 # This file
├── run.py                    # API server entry point
├── test_server.py            # Minimal server smoke-test (optional)
├── assets/
│   └── log_classifier_before_after.png  # README image
├── data/
│   └── training_data.csv     # Training data (~98 examples)
├── models/                   # Saved model files (created after training; ignored by git)
│   ├── feature_union.pkl     # Feature pipeline (TF-IDF + engineered features)
│   └── logistic_model.pkl    # Trained Logistic Regression model
├── scripts/
│   └── generate_linkedin_image.py  # Generates the README image (optional)
└── src/
    ├── __init__.py
    ├── train_model.py        # Training script with hyperparameter tuning
    ├── classifier.py         # Error classifier with category metadata
    └── api.py                # FastAPI application

Installation & Setup

Prerequisites

Python 3.9 or higher installed
pip (Python package manager)

Step 1: Setup Environment

# Navigate to project folder
cd "C:\path\to\log-intent-classifier"

# Create virtual environment
python -m venv venv

# Activate virtual environment (Windows PowerShell)
.\venv\Scripts\Activate.ps1

# Install dependencies (downloads packages from PyPI, ~50MB total)
pip install -r requirements.txt

Note: If you get an execution policy error in PowerShell, run:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Step 2: Train the Model

Train the model using the provided training data:

python -m src.train_model

What happens:

Reads data/training_data.csv (~98 examples across 10 categories)
Enhanced text preprocessing (normalizes whitespace, extracts error patterns)
Extracts engineered features (error type indicators, message stats, special patterns)
Builds TF-IDF vocabulary with optimized parameters (10,000 features)
Performs hyperparameter tuning with GridSearchCV (5-fold cross-validation)
Trains Logistic Regression classifier with best parameters
Saves 2 files to models/ folder:
- feature_union.pkl (~50KB) - Complete feature pipeline
- logistic_model.pkl (~10KB) - Trained classifier
Prints comprehensive evaluation metrics

Expected output:

Loading training data...
Preprocessing text data...
Building feature pipeline...
Extracting features...
Hyperparameter tuning with GridSearchCV...
Best parameters: {'C': 10, 'class_weight': 'balanced', 'solver': 'liblinear'}
Best cross-validation score: 0.95+

Test set accuracy: 0.95+
Classification Report: (precision, recall, F1-score per category)
Confusion Matrix: (shows classification performance)
Cross-validation F1 scores: [...]
Mean CV F1 score: 0.95+ (+/- ...)

Saving models...
Feature union pipeline saved to: models/feature_union.pkl
Logistic Regression model saved to: models/logistic_model.pkl

Training complete!

Step 3: Start the API Server

python run.py

Expected output:

INFO:     Loading models from models/...
INFO:     Error intent classification models loaded successfully!
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO:     Started server process
INFO:     Waiting for application startup.
INFO:     Application startup complete.

The API is now running at http://127.0.0.1:8000

Usage

Interactive API Documentation

Open your browser and navigate to:

Swagger UI: http://127.0.0.1:8000/docs
ReDoc: http://127.0.0.1:8000/redoc

API Endpoints

1. Health Check

Invoke-RestMethod -Uri "http://127.0.0.1:8000/health"

Response:

{
  "status": "healthy",
  "models_loaded": true,
  "api_version": "2.0.0"
}

2. Get Error Categories

Invoke-RestMethod -Uri "http://127.0.0.1:8000/error-categories"

Response:

{
  "categories": [
    {
      "name": "NULL_REFERENCE",
      "description": "Null pointer or reference errors when accessing uninitialized objects",
      "suggested_action": "Check for null values before accessing object members, add null checks"
    },
    {
      "name": "NETWORK_CONNECTION",
      "description": "Network connectivity issues (timeout, refused, unreachable)",
      "suggested_action": "Check network connectivity, verify service availability, review firewall rules"
    }
    // ... (8 more categories)
  ],
  "total_count": 10
}

3. Classify Single Error Message

$body = @{
    error_message = "Connection timeout: failed to connect to database after 30 seconds"
    include_metadata = $true
} | ConvertTo-Json

Invoke-RestMethod -Uri "http://127.0.0.1:8000/classify-error" `
  -Method POST `
  -ContentType "application/json" `
  -Body $body

Response:

{
  "error_category": "NETWORK_CONNECTION",
  "confidence": 0.89,
  "all_probabilities": {
    "NULL_REFERENCE": 0.01,
    "RESOURCE_EXHAUSTION": 0.02,
    "NETWORK_CONNECTION": 0.89,
    "DUPLICATE_CONFLICT": 0.01,
    "DATABASE_ERROR": 0.02,
    "TIMEOUT": 0.03,
    "CONFIGURATION": 0.01,
    "NOT_FOUND": 0.01,
    "AUTHENTICATION": 0.00,
    "VALIDATION": 0.00
  },
  "description": "Network connectivity issues (timeout, refused, unreachable)",
  "suggested_action": "Check network connectivity, verify service availability, review firewall rules"
}

4. Classify Multiple Error Messages (Batch)

$body = @{
    error_messages = @(
        "NullPointerException in UserService.getUser()",
        "Duplicate key error: user with email already exists",
        "Invalid input: age must be between 0 and 120"
    )
    include_metadata = $true
} | ConvertTo-Json

Invoke-RestMethod -Uri "http://127.0.0.1:8000/classify-errors" `
  -Method POST `
  -ContentType "application/json" `
  -Body $body

Response:

{
  "results": [
    {
      "error_category": "NULL_REFERENCE",
      "confidence": 0.92,
      "all_probabilities": {...},
      "description": "Null pointer or reference errors...",
      "suggested_action": "Check for null values before accessing..."
    },
    {
      "error_category": "DUPLICATE_CONFLICT",
      "confidence": 0.87,
      "all_probabilities": {...},
      "description": "Duplicate entries or unique constraint violations",
      "suggested_action": "Check for existing records before insertion..."
    },
    {
      "error_category": "VALIDATION",
      "confidence": 0.85,
      "all_probabilities": {...},
      "description": "Input validation or data format errors",
      "suggested_action": "Validate input format, check data constraints..."
    }
  ]
}

Using cURL (Alternative)

# Single classification
curl -X POST "http://127.0.0.1:8000/classify-error" \
  -H "Content-Type: application/json" \
  -d '{"error_message": "Connection timeout after 30 seconds", "include_metadata": true}'

# Batch classification
curl -X POST "http://127.0.0.1:8000/classify-errors" \
  -H "Content-Type: application/json" \
  -d '{"error_messages": ["NullPointerException...", "Duplicate key..."], "include_metadata": true}'

Use Cases

1. Error Monitoring & Aggregation

Automatically categorize incoming errors for better monitoring:

# Classify errors from application logs
errors = ["Connection timeout...", "NullPointerException...", ...]
response = classify_errors_batch(errors)

# Group by category for dashboard
error_counts = {}
for result in response['results']:
    category = result['error_category']
    error_counts[category] = error_counts.get(category, 0) + 1

2. Intelligent Alerting

Route alerts to appropriate teams based on error type:

result = classify_error("Database deadlock detected")
if result['error_category'] == 'DATABASE_ERROR':
    alert_database_team(result)
elif result['error_category'] == 'NETWORK_CONNECTION':
    alert_infrastructure_team(result)

3. Automated Ticket Routing

Create tickets with proper categorization and suggested actions:

result = classify_error(error_message)
create_ticket(
    category=result['error_category'],
    description=result['description'],
    suggested_action=result['suggested_action'],
    priority=calculate_priority(result['confidence'])
)

4. Error Trend Analysis

Analyze error patterns over time:

# Classify all errors from the past week
errors_this_week = get_errors_from_logs()
classifications = classify_errors_batch(errors_this_week)

# Identify trending error categories
analyze_trends(classifications)

Customizing Training Data

To improve classification for your specific errors:

Edit data/training_data.csv
Add your own error examples with correct categories
Ensure CSV format: error_message,error_category
Use these category names: NULL_REFERENCE, RESOURCE_EXHAUSTION, NETWORK_CONNECTION, DUPLICATE_CONFLICT, DATABASE_ERROR, TIMEOUT, CONFIGURATION, NOT_FOUND, AUTHENTICATION, VALIDATION
Re-run training: python -m src.train_model
Restart the API server

Example:

error_message,error_category
"Custom null error in MyService",NULL_REFERENCE
"My application out of memory",RESOURCE_EXHAUSTION

How It Works

Model Architecture

Enhanced Preprocessing:
- Normalizes whitespace and removes punctuation
- Preserves important error patterns (exception names, error codes)
Feature Engineering:
- Extracts error type indicators (null, timeout, not found, etc.)
- Calculates message statistics (character count, word count)
- Detects special patterns (IP addresses, file paths, numbers)
TF-IDF Vectorization:
- Converts error text into numerical features using optimized parameters
- Uses unigrams and bigrams for better context
- Vocabulary size: 10,000 features
- Applies sublinear term frequency scaling
Feature Combination:
- Combines TF-IDF features with engineered features using FeatureUnion
- Scales engineered features for optimal performance
Logistic Regression:
- Hyperparameter-tuned with GridSearchCV (5-fold cross-validation)
- Uses L2 regularization to prevent overfitting
- Optimized C parameter and solver selection
Classification:
- Predicts error category with confidence score
- Returns probabilities for all categories
- Provides human-readable descriptions and suggested actions

Why Semantic Intent Classification?

Traditional approach:

"ERROR: Connection timeout" → category: ERROR

Semantic intent approach:

"Connection timeout after 30 seconds" → category: NETWORK_CONNECTION
  + description: "Network connectivity issues (timeout, refused, unreachable)"
  + suggested_action: "Check network connectivity, verify service availability"

Benefits:

Actionable insights: Know what type of error occurred, not just severity
Better routing: Send errors to the right team (DB team, network team, etc.)
Improved monitoring: Track specific error patterns (e.g., spike in TIMEOUT errors)
Human-friendly: Categories are self-explanatory

How Classification Works (Simple Explanation)

Think of it like a smart email spam filter, but for error messages

Step 1: When You Call the API

You send an error message to /classify-error:

POST /classify-error
{
  "error_message": "Database connection timeout after 30 seconds"
}

Step 2: Text Cleaning

Your message gets cleaned up:

Converted to lowercase
Special characters normalized
Important patterns preserved

"Database connection timeout after 30 seconds" 
  → "database connection timeout after 30 seconds"

Step 3: Converting to Numbers

Computers can't think about words directly, so the text is converted to numbers:

Each word gets a score based on how important/rare it is
The system learned these scores from the training data
This creates a unique "fingerprint" for your error message

Step 4: The AI Makes a Decision

The AI (Logistic Regression model) scores all 10 categories:

Input: "Database connection timeout after 30 seconds"

Scoring:
✅ NETWORK_CONNECTION:  85 points  (has "connection", "timeout")
   DATABASE_ERROR:      45 points  (has "database")
   RESOURCE_EXHAUSTION: 12 points
   TIMEOUT:             12 points
   NULL_REFERENCE:       2 points
   ... others:          1-5 points

Winner: NETWORK_CONNECTION with 85% confidence!

How did it learn to score? During training, the model studied ~98 labeled examples and learned patterns like:

Messages with "null", "NullPointer" → usually NULL_REFERENCE
Messages with "timeout", "connection failed" → usually NETWORK_CONNECTION
Messages with "memory", "exhausted" → usually RESOURCE_EXHAUSTION

Step 5: You Get the Result

{
  "error_category": "NETWORK_CONNECTION",
  "confidence": 0.85,
  "all_probabilities": {
    "NETWORK_CONNECTION": 0.85,
    "DATABASE_ERROR": 0.45,
    "TIMEOUT": 0.12,
    ...
  },
  "description": "Network connectivity issues (timeout, refused, unreachable)",
  "suggested_action": "Check network connectivity, verify service availability"
}

The Learning Process (Training)

When you run python -m src.train_model, here's what happens:

Study Phase: The model reads ~98 example error messages with their correct categories
Pattern Recognition: It finds patterns - which words appear in which categories
Practice Tests: It tests itself with cross-validation to check accuracy
Save Knowledge: It saves what it learned to models/ folder
Ready to Use: When the API starts, it loads this saved knowledge

It's like studying for a test:

Training data = study materials
Training = studying and memorizing
Models folder = brain with memorized knowledge
Classification = answering questions on the test

That's it! The system uses math (not magic) to recognize patterns and classify new errors based on what it learned from examples.

Troubleshooting

Models not found error

Error: FileNotFoundError: Feature union pipeline not found

Solution: Run the training script first:

python -m src.train_model

Port already in use

Error: Address already in use

Solution: Change the port in run.py:

uvicorn.run("src.api:app", host="127.0.0.1", port=8001)

Import errors

Error: ModuleNotFoundError

Solution: Ensure virtual environment is activated and dependencies are installed:

.\venv\Scripts\Activate.ps1
pip install -r requirements.txt

Low confidence scores

If confidence scores are consistently low (<0.5):

Add more training examples for underrepresented categories
Review ambiguous error messages that don't fit clearly into one category
Consider adding more specific examples to training data

Model Performance

The model uses several techniques for high accuracy:

Feature Engineering: Extracts semantic indicators from error messages
TF-IDF Optimization: 10,000 features, bigrams, sublinear scaling
Hyperparameter Tuning: GridSearchCV with 5-fold cross-validation
Regularization: L2 penalty to prevent overfitting
Balanced Classes: Optional class weighting for imbalanced datasets

Expected Performance:

Test accuracy: 90-100%
Cross-validation F1: 0.90+
Per-category precision/recall: 0.85+

Dependencies

scikit-learn - Machine learning library (TF-IDF, Logistic Regression, GridSearchCV)
pandas - Data manipulation
numpy - Numerical operations
fastapi - Web framework for API
uvicorn - ASGI server
joblib - Model serialization
pydantic - Data validation

API Migration (v1 to v2)

Breaking Changes

Endpoints renamed:

/classify → /classify-error
/classify/batch → /classify-errors
/categories → /error-categories

Request/Response changes:

Request field: log_entry → error_message
Response field: category → error_category
Added: description and suggested_action fields

Backward Compatibility

Legacy endpoints still work:

/classify redirects to /classify-error
/categories redirects to /error-categories

License

This project is provided as-is for demonstration purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
data		data
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
test_server.py		test_server.py

Folders and files

Latest commit

History

Repository files navigation

Log Intent Classifier

Important: No LLM Required

Features

Error Categories

Project Structure

Installation & Setup

Prerequisites

Step 1: Setup Environment

Step 2: Train the Model

Step 3: Start the API Server

Usage

Interactive API Documentation

API Endpoints

1. Health Check

2. Get Error Categories

3. Classify Single Error Message

4. Classify Multiple Error Messages (Batch)

Using cURL (Alternative)

Use Cases

1. Error Monitoring & Aggregation

2. Intelligent Alerting

3. Automated Ticket Routing

4. Error Trend Analysis

Customizing Training Data

How It Works

Model Architecture

Why Semantic Intent Classification?

How Classification Works (Simple Explanation)

Think of it like a smart email spam filter, but for error messages

Step 1: When You Call the API

Step 2: Text Cleaning

Step 3: Converting to Numbers

Step 4: The AI Makes a Decision

Step 5: You Get the Result

The Learning Process (Training)

Troubleshooting

Models not found error

Port already in use

Import errors

Low confidence scores

Model Performance

Dependencies

API Migration (v1 to v2)

Breaking Changes

Backward Compatibility

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages