A production-ready MLOps pipeline that automatically classifies Reddit content using advanced multi-label machine learning with enterprise-level automation, experiment tracking, and continuous learning.
- Multi-Label Classification: Simultaneous analysis across 5 dimensions (Safety, Toxicity, Sentiment, Topic, Engagement).
- Automated MLOps: Weekly retraining with automated champion model selection.
- Experiment Tracking: Full integration with MLflow & DagsHub to track metrics (F1-score, Accuracy) for every run.
- Production Scaling: Handles 25,000+ posts per training cycle.
- Real-time Inference: Sub-second response times using optimized
.joblibmodel serialization.
| Category | Description | Classifications |
|---|---|---|
| Safety | Content safety assessment | Safe, NSFW |
| Toxicity | Harmful content detection | Non-toxic, Toxic |
| Sentiment | Emotional tone analysis | Positive, Neutral, Negative |
| Topic | Content categorization | Technology, Gaming, Business, Health |
| Engagement | Viral potential prediction | High, Low Engagement |
MLOps Pipeline:
- Data Collection: Weekly automated Reddit data ingestion (25,000+ posts).
- Feature Engineering: TF-IDF vectorization (10k features, 1-2 grams).
- Multi-Model Training: Simulates 5 different algorithms (Logistic Regression, SVM, Naive Bayes, LightGBM, MLP) simultaneously.
- Experiment Tracking: Logs all parameters and metrics to DagsHub/MLflow.
- Champion Selection: Automatically compares F1-scores and selects the single best model for deployment.
- Deployment: Automated Git LFS versioning and cloud deployment.
| Metric | Value | Description |
|---|---|---|
| Binary F1-Score | 88.3% | SFW/NSFW classification accuracy |
| Multi-Label Jaccard | 82.7% | Overall multi-category performance |
| Training Data | 25,000+ | Reddit posts per training cycle |
| Inference Speed | <100ms | Real-time response capability |
| Model Size | ~50MB | Optimized .joblib compression |
| Automation | Weekly | Continuous learning and updates |
Core Technologies:
- Python 3.11, Scikit-learn, LightGBM, Pandas, NumPy
- Streamlit, Plotly (Visualization)
- PRAW (Reddit API), TF-IDF (NLP)
MLOps & Infrastructure:
- MLflow (Experiment Tracking), DagsHub (Remote Storage)
- GitHub Actions (CI/CD), Git LFS (Model Versioning)
- Joblib (Efficient Model Serialization)
- Python 3.11+
- Git with Git LFS support
- Reddit API credentials (for data collection)
- DagsHub Account (for experiment tracking)
git clone https://github.com/RobinMillford/Reddit-content-classifier.git
cd Reddit-content-classifier
# Setup Git LFS for model files
git lfs install
git lfs pull# Create virtual environment
python -m venv myenv
# Activate virtual environment
# Windows:
myenv\Scripts\activate
# macOS/Linux:
source myenv/bin/activate
# Install dependencies
pip install -r requirements.txtCreate a .env file in the project root. You need both Reddit API keys (for data) and DagsHub keys (for tracking).
# Reddit API (Data Collection)
REDDIT_CLIENT_ID=your_reddit_client_id
REDDIT_CLIENT_SECRET=your_reddit_secret
REDDIT_USER_AGENT=YourAppName/1.0
# DagsHub/MLflow (Experiment Tracking)
DAGSHUB_OWNER=your_dagshub_username
DAGSHUB_REPO=your_dagshub_repo_name
DAGSHUB_TOKEN=your_dagshub_token_here
EXPERIMENT_NAME=your_experiment_name_here# Start the web application
streamlit run app.py🌐 Access: Application runs at http://localhost:8501
To run the pipeline manually and trigger the Automatic Champion Selection:
# 1. Collect fresh training data
python src/ingest_data.py
# 2. Train 10 models, log to DagsHub, and save the best one locally
python src/train.py├── src/
│ ├── ingest_data.py # Reddit data collection script
│ └── train.py # ML training & Auto-selection logic
├── .github/workflows/ # CI/CD automation
├── app.py # Streamlit web application
├── best_binary_model.joblib # The Champion Binary Model (Git LFS)
├── best_multi_model.joblib # The Champion Multi-Label Model (Git LFS)
├── tfidf_vectorizer.joblib # Text preprocessing pipeline (Git LFS)
└── model_metadata.joblib # Model labels & encoders (Git LFS)
Business Value: Demonstrates end-to-end ML engineering capabilities with production-ready automation, remote experiment tracking, and scalable infrastructure design.
Technical Expertise: Showcases expertise in MLOps, MLflow integration, automated pipelines, multi-label classification, and cloud deployment strategies.
Results Delivered: 88%+ accuracy system processing 25,000+ posts weekly with zero-downtime continuous deployment.
This project is open source and welcomes contributions from the community.
How to Contribute:
- Fork the repository
- Create a feature branch:
git checkout -b feature/enhancement - Make your changes with proper testing
- Submit a pull request with detailed description
Project Repository: github.com/RobinMillford/Reddit-content-classifier
This project demonstrates production-ready MLOps implementation suitable for enterprise content moderation systems.
