FastAPI service that classifies error messages by intent (not just log level). It uses TF-IDF + Logistic Regression to map an error string into 10 practical categories (e.g., NULL_REFERENCE, NETWORK_CONNECTION, RESOURCE_EXHAUSTION), returning a confidence score + suggested action.
This solution uses traditional machine learning (TF-IDF + Logistic Regression), NOT a large language model:
- ✅ No model download required - the model is trained locally from scratch using your training data
- ✅ No GPU needed - runs on any CPU
- ✅ No internet required after initial pip install - fully offline capable
- ✅ Training takes seconds - not hours
- ✅ Model size is tiny - a few KB, not GB
- RESTful API with FastAPI
- Semantic error classification by intent (not just log levels)
- 10 human-readable error categories
- Confidence scores and probability distributions
- Category descriptions and suggested actions for each error type
- Single and batch classification endpoints
- Health check and category listing endpoints
- Interactive API documentation (Swagger UI)
- Enhanced model with feature engineering and hyperparameter tuning
The API classifies errors into 10 semantic categories:
| Category | Description | Example |
|---|---|---|
| NULL_REFERENCE | Null pointer or reference errors | "NullPointerException in UserService" |
| RESOURCE_EXHAUSTION | Out of memory, disk full, connection pool exhausted | "OutOfMemoryError: Java heap space" |
| NETWORK_CONNECTION | Connection timeout, refused, unreachable | "Connection timeout after 30 seconds" |
| DUPLICATE_CONFLICT | Entry already exists, unique constraint violation | "Duplicate key error: email already exists" |
| DATABASE_ERROR | SQL errors, deadlocks, transaction failures | "SQLException: table 'users' does not exist" |
| TIMEOUT | Operation exceeded time limit | "Request timeout: exceeded 30 second limit" |
| CONFIGURATION | Missing/invalid configuration | "Required property 'database.url' not found" |
| NOT_FOUND | Resource, file, or endpoint not found | "FileNotFoundException: config.properties" |
| AUTHENTICATION | Invalid credentials, access denied | "Authentication failed: invalid password" |
| VALIDATION | Input validation or format errors | "Invalid email format" |
log-intent-classifier/
├── .gitignore # Ignore venv, generated models, etc.
├── requirements.txt # Python dependencies
├── README.md # This file
├── run.py # API server entry point
├── test_server.py # Minimal server smoke-test (optional)
├── assets/
│ └── log_classifier_before_after.png # README image
├── data/
│ └── training_data.csv # Training data (~98 examples)
├── models/ # Saved model files (created after training; ignored by git)
│ ├── feature_union.pkl # Feature pipeline (TF-IDF + engineered features)
│ └── logistic_model.pkl # Trained Logistic Regression model
├── scripts/
│ └── generate_linkedin_image.py # Generates the README image (optional)
└── src/
├── __init__.py
├── train_model.py # Training script with hyperparameter tuning
├── classifier.py # Error classifier with category metadata
└── api.py # FastAPI application
- Python 3.9 or higher installed
- pip (Python package manager)
# Navigate to project folder
cd "C:\path\to\log-intent-classifier"
# Create virtual environment
python -m venv venv
# Activate virtual environment (Windows PowerShell)
.\venv\Scripts\Activate.ps1
# Install dependencies (downloads packages from PyPI, ~50MB total)
pip install -r requirements.txtNote: If you get an execution policy error in PowerShell, run:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUserTrain the model using the provided training data:
python -m src.train_modelWhat happens:
- Reads
data/training_data.csv(~98 examples across 10 categories) - Enhanced text preprocessing (normalizes whitespace, extracts error patterns)
- Extracts engineered features (error type indicators, message stats, special patterns)
- Builds TF-IDF vocabulary with optimized parameters (10,000 features)
- Performs hyperparameter tuning with GridSearchCV (5-fold cross-validation)
- Trains Logistic Regression classifier with best parameters
- Saves 2 files to
models/folder:feature_union.pkl(~50KB) - Complete feature pipelinelogistic_model.pkl(~10KB) - Trained classifier
- Prints comprehensive evaluation metrics
Expected output:
Loading training data...
Preprocessing text data...
Building feature pipeline...
Extracting features...
Hyperparameter tuning with GridSearchCV...
Best parameters: {'C': 10, 'class_weight': 'balanced', 'solver': 'liblinear'}
Best cross-validation score: 0.95+
Test set accuracy: 0.95+
Classification Report: (precision, recall, F1-score per category)
Confusion Matrix: (shows classification performance)
Cross-validation F1 scores: [...]
Mean CV F1 score: 0.95+ (+/- ...)
Saving models...
Feature union pipeline saved to: models/feature_union.pkl
Logistic Regression model saved to: models/logistic_model.pkl
Training complete!
python run.pyExpected output:
INFO: Loading models from models/...
INFO: Error intent classification models loaded successfully!
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
INFO: Started server process
INFO: Waiting for application startup.
INFO: Application startup complete.
The API is now running at http://127.0.0.1:8000
Open your browser and navigate to:
- Swagger UI: http://127.0.0.1:8000/docs
- ReDoc: http://127.0.0.1:8000/redoc
Invoke-RestMethod -Uri "http://127.0.0.1:8000/health"Response:
{
"status": "healthy",
"models_loaded": true,
"api_version": "2.0.0"
}Invoke-RestMethod -Uri "http://127.0.0.1:8000/error-categories"Response:
{
"categories": [
{
"name": "NULL_REFERENCE",
"description": "Null pointer or reference errors when accessing uninitialized objects",
"suggested_action": "Check for null values before accessing object members, add null checks"
},
{
"name": "NETWORK_CONNECTION",
"description": "Network connectivity issues (timeout, refused, unreachable)",
"suggested_action": "Check network connectivity, verify service availability, review firewall rules"
}
// ... (8 more categories)
],
"total_count": 10
}$body = @{
error_message = "Connection timeout: failed to connect to database after 30 seconds"
include_metadata = $true
} | ConvertTo-Json
Invoke-RestMethod -Uri "http://127.0.0.1:8000/classify-error" `
-Method POST `
-ContentType "application/json" `
-Body $bodyResponse:
{
"error_category": "NETWORK_CONNECTION",
"confidence": 0.89,
"all_probabilities": {
"NULL_REFERENCE": 0.01,
"RESOURCE_EXHAUSTION": 0.02,
"NETWORK_CONNECTION": 0.89,
"DUPLICATE_CONFLICT": 0.01,
"DATABASE_ERROR": 0.02,
"TIMEOUT": 0.03,
"CONFIGURATION": 0.01,
"NOT_FOUND": 0.01,
"AUTHENTICATION": 0.00,
"VALIDATION": 0.00
},
"description": "Network connectivity issues (timeout, refused, unreachable)",
"suggested_action": "Check network connectivity, verify service availability, review firewall rules"
}$body = @{
error_messages = @(
"NullPointerException in UserService.getUser()",
"Duplicate key error: user with email already exists",
"Invalid input: age must be between 0 and 120"
)
include_metadata = $true
} | ConvertTo-Json
Invoke-RestMethod -Uri "http://127.0.0.1:8000/classify-errors" `
-Method POST `
-ContentType "application/json" `
-Body $bodyResponse:
{
"results": [
{
"error_category": "NULL_REFERENCE",
"confidence": 0.92,
"all_probabilities": {...},
"description": "Null pointer or reference errors...",
"suggested_action": "Check for null values before accessing..."
},
{
"error_category": "DUPLICATE_CONFLICT",
"confidence": 0.87,
"all_probabilities": {...},
"description": "Duplicate entries or unique constraint violations",
"suggested_action": "Check for existing records before insertion..."
},
{
"error_category": "VALIDATION",
"confidence": 0.85,
"all_probabilities": {...},
"description": "Input validation or data format errors",
"suggested_action": "Validate input format, check data constraints..."
}
]
}# Single classification
curl -X POST "http://127.0.0.1:8000/classify-error" \
-H "Content-Type: application/json" \
-d '{"error_message": "Connection timeout after 30 seconds", "include_metadata": true}'
# Batch classification
curl -X POST "http://127.0.0.1:8000/classify-errors" \
-H "Content-Type: application/json" \
-d '{"error_messages": ["NullPointerException...", "Duplicate key..."], "include_metadata": true}'Automatically categorize incoming errors for better monitoring:
# Classify errors from application logs
errors = ["Connection timeout...", "NullPointerException...", ...]
response = classify_errors_batch(errors)
# Group by category for dashboard
error_counts = {}
for result in response['results']:
category = result['error_category']
error_counts[category] = error_counts.get(category, 0) + 1Route alerts to appropriate teams based on error type:
result = classify_error("Database deadlock detected")
if result['error_category'] == 'DATABASE_ERROR':
alert_database_team(result)
elif result['error_category'] == 'NETWORK_CONNECTION':
alert_infrastructure_team(result)Create tickets with proper categorization and suggested actions:
result = classify_error(error_message)
create_ticket(
category=result['error_category'],
description=result['description'],
suggested_action=result['suggested_action'],
priority=calculate_priority(result['confidence'])
)Analyze error patterns over time:
# Classify all errors from the past week
errors_this_week = get_errors_from_logs()
classifications = classify_errors_batch(errors_this_week)
# Identify trending error categories
analyze_trends(classifications)To improve classification for your specific errors:
- Edit
data/training_data.csv - Add your own error examples with correct categories
- Ensure CSV format:
error_message,error_category - Use these category names:
NULL_REFERENCE,RESOURCE_EXHAUSTION,NETWORK_CONNECTION,DUPLICATE_CONFLICT,DATABASE_ERROR,TIMEOUT,CONFIGURATION,NOT_FOUND,AUTHENTICATION,VALIDATION - Re-run training:
python -m src.train_model - Restart the API server
Example:
error_message,error_category
"Custom null error in MyService",NULL_REFERENCE
"My application out of memory",RESOURCE_EXHAUSTION-
Enhanced Preprocessing:
- Normalizes whitespace and removes punctuation
- Preserves important error patterns (exception names, error codes)
-
Feature Engineering:
- Extracts error type indicators (null, timeout, not found, etc.)
- Calculates message statistics (character count, word count)
- Detects special patterns (IP addresses, file paths, numbers)
-
TF-IDF Vectorization:
- Converts error text into numerical features using optimized parameters
- Uses unigrams and bigrams for better context
- Vocabulary size: 10,000 features
- Applies sublinear term frequency scaling
-
Feature Combination:
- Combines TF-IDF features with engineered features using FeatureUnion
- Scales engineered features for optimal performance
-
Logistic Regression:
- Hyperparameter-tuned with GridSearchCV (5-fold cross-validation)
- Uses L2 regularization to prevent overfitting
- Optimized C parameter and solver selection
-
Classification:
- Predicts error category with confidence score
- Returns probabilities for all categories
- Provides human-readable descriptions and suggested actions
Traditional approach:
"ERROR: Connection timeout" → category: ERROR
Semantic intent approach:
"Connection timeout after 30 seconds" → category: NETWORK_CONNECTION
+ description: "Network connectivity issues (timeout, refused, unreachable)"
+ suggested_action: "Check network connectivity, verify service availability"
Benefits:
- Actionable insights: Know what type of error occurred, not just severity
- Better routing: Send errors to the right team (DB team, network team, etc.)
- Improved monitoring: Track specific error patterns (e.g., spike in TIMEOUT errors)
- Human-friendly: Categories are self-explanatory
You send an error message to /classify-error:
POST /classify-error
{
"error_message": "Database connection timeout after 30 seconds"
}Your message gets cleaned up:
- Converted to lowercase
- Special characters normalized
- Important patterns preserved
"Database connection timeout after 30 seconds"
→ "database connection timeout after 30 seconds"
Computers can't think about words directly, so the text is converted to numbers:
- Each word gets a score based on how important/rare it is
- The system learned these scores from the training data
- This creates a unique "fingerprint" for your error message
The AI (Logistic Regression model) scores all 10 categories:
Input: "Database connection timeout after 30 seconds"
Scoring:
✅ NETWORK_CONNECTION: 85 points (has "connection", "timeout")
DATABASE_ERROR: 45 points (has "database")
RESOURCE_EXHAUSTION: 12 points
TIMEOUT: 12 points
NULL_REFERENCE: 2 points
... others: 1-5 points
Winner: NETWORK_CONNECTION with 85% confidence!
How did it learn to score? During training, the model studied ~98 labeled examples and learned patterns like:
- Messages with "null", "NullPointer" → usually
NULL_REFERENCE - Messages with "timeout", "connection failed" → usually
NETWORK_CONNECTION - Messages with "memory", "exhausted" → usually
RESOURCE_EXHAUSTION
{
"error_category": "NETWORK_CONNECTION",
"confidence": 0.85,
"all_probabilities": {
"NETWORK_CONNECTION": 0.85,
"DATABASE_ERROR": 0.45,
"TIMEOUT": 0.12,
...
},
"description": "Network connectivity issues (timeout, refused, unreachable)",
"suggested_action": "Check network connectivity, verify service availability"
}When you run python -m src.train_model, here's what happens:
- Study Phase: The model reads ~98 example error messages with their correct categories
- Pattern Recognition: It finds patterns - which words appear in which categories
- Practice Tests: It tests itself with cross-validation to check accuracy
- Save Knowledge: It saves what it learned to
models/folder - Ready to Use: When the API starts, it loads this saved knowledge
It's like studying for a test:
- Training data = study materials
- Training = studying and memorizing
- Models folder = brain with memorized knowledge
- Classification = answering questions on the test
That's it! The system uses math (not magic) to recognize patterns and classify new errors based on what it learned from examples.
Error: FileNotFoundError: Feature union pipeline not found
Solution: Run the training script first:
python -m src.train_modelError: Address already in use
Solution: Change the port in run.py:
uvicorn.run("src.api:app", host="127.0.0.1", port=8001)Error: ModuleNotFoundError
Solution: Ensure virtual environment is activated and dependencies are installed:
.\venv\Scripts\Activate.ps1
pip install -r requirements.txtIf confidence scores are consistently low (<0.5):
- Add more training examples for underrepresented categories
- Review ambiguous error messages that don't fit clearly into one category
- Consider adding more specific examples to training data
The model uses several techniques for high accuracy:
- Feature Engineering: Extracts semantic indicators from error messages
- TF-IDF Optimization: 10,000 features, bigrams, sublinear scaling
- Hyperparameter Tuning: GridSearchCV with 5-fold cross-validation
- Regularization: L2 penalty to prevent overfitting
- Balanced Classes: Optional class weighting for imbalanced datasets
Expected Performance:
- Test accuracy: 90-100%
- Cross-validation F1: 0.90+
- Per-category precision/recall: 0.85+
scikit-learn- Machine learning library (TF-IDF, Logistic Regression, GridSearchCV)pandas- Data manipulationnumpy- Numerical operationsfastapi- Web framework for APIuvicorn- ASGI serverjoblib- Model serializationpydantic- Data validation
Endpoints renamed:
/classify→/classify-error/classify/batch→/classify-errors/categories→/error-categories
Request/Response changes:
- Request field:
log_entry→error_message - Response field:
category→error_category - Added:
descriptionandsuggested_actionfields
Legacy endpoints still work:
/classifyredirects to/classify-error/categoriesredirects to/error-categories
This project is provided as-is for demonstration purposes.
