Instagram Post Scraping & Event Classification Pipeline
An autonomous event management system with AI-powered content analysis, engineered by the IEEE MSIT Development Team.
This revolutionary automation system eliminates the need for manual administrative oversight by autonomously scraping, analyzing, and classifying Instagram posts from IEEE MSIT's official account. The system intelligently distinguishes between events and achievements, generates structured JSON metadata using advanced AI models, implements robust duplicate detection mechanisms, and seamlessly integrates with frontend applications for dynamic content rendering.
The system operates on a daily cron schedule, ensuring continuous content synchronization without human intervention:
- Instagram Content Extraction → Advanced scraping using
instagrapiwith session persistence - AI-Powered Classification → Google Gemini 2.5 Flash model for intelligent content analysis
- Structured Data Generation → Pydantic-validated JSON schema with comprehensive metadata
- Duplicate Detection & Prevention → Semantic similarity analysis using LangChain
- Cloud Storage Integration → Cloudinary CDN for optimized image delivery
- Database Persistence → MongoDB Atlas with Motor async driver
- Frontend API Delivery → FastAPI endpoints for real-time data consumption
This system leverages cutting-edge technologies to deliver enterprise-grade automation capabilities.
| Component | Technologies |
|---|---|
| Backend Framework | FastAPI, Uvicorn |
| AI & Machine Learning | Google Gemini 2.5 Flash, LangChain Core, Structured Outputs |
| Instagram API Integration | Instagrapi (Advanced Instagram Private API) |
| Cloud Infrastructure | Cloudinary CDN, MongoDB Atlas |
| Data Processing | Pydantic, Motor (Async MongoDB), HTTPX |
| Authentication & Security | Environment-based configuration, Session management |
| Scheduling & Automation | Cron-based daily execution |
| Image Processing | Base64 encoding, Multi-format support |
The system employs sophisticated AI models to perform multi-dimensional content classification:
| Classification Feature | Implementation |
|---|---|
| Event Type Detection | Workshop, Hackathon, Seminar, Conference, Bootcamp, Webinar classification |
| Category Extraction | AI/ML, Web Development, Cybersecurity, Sustainability domain identification |
| Status Determination | Upcoming, Registration-open, Live, Completed status inference |
| Temporal Analysis | Date extraction and event timeline processing |
| Relevance Filtering | Event vs. Achievement vs. Announcement classification |
| Duplicate Prevention | Semantic similarity analysis with fuzzy matching algorithms |
interface EventInfo {
title?: string; // AI-extracted event title
type?: string; // Event categorization
category?: string; // Domain classification
status?: string; // Current event status
startDate?: string; // Temporal extraction
endDate?: string; // Event duration
venue?: string; // Location identification
registrationType?: string; // Access level determination
actionLinks?: string[]; // Contact and registration extraction
prizes?: string[]; // Prize structure identification
description?: string; // Comprehensive event details
isRelevant: boolean; // AI relevance determination
cloudinary_url?: string; // CDN-optimized image URL
post_date?: string; // Original posting timestamp
}graph TD
A[Cron Trigger - Daily] --> B[Instagram Session Authentication]
B --> C[Scrape Latest Posts - Max 45]
C --> D[Image & Caption Extraction]
D --> E[AI Content Analysis - Gemini 2.5]
E --> F{Event Relevance Check}
F -->|Relevant| G[Database Query - Existing Events]
F -->|Irrelevant| H[Skip Processing]
G --> I{Duplicate Detection}
I -->|Unique| J[Cloudinary Upload]
I -->|Duplicate| K[Skip Insertion]
J --> L[MongoDB Insertion]
L --> M[JSON Response Generation]
M --> N[Frontend API Consumption]
The system implements intelligent API key rotation to handle high-volume processing:
class APIKeyManager:
- Round-robin key distribution
- Automatic failover mechanisms
- Rate limit optimization
- Multi-key concurrent processing| Feature | Description |
|---|---|
| Autonomous Session Management | Persistent Instagram authentication with automatic session recovery and regeneration |
| Intelligent Rate Limiting | Dynamic delay mechanisms (30-120s) to prevent API throttling and maintain compliance |
| Multi-Format Media Support | Comprehensive handling of images, carousels, and video thumbnails with format optimization |
| Cloud-Native Architecture | Serverless-ready design with horizontal scaling capabilities |
| Real-time Processing | Asynchronous execution pipeline with concurrent processing optimization |
| Error Recovery Systems | Comprehensive exception handling with automatic retry mechanisms |
| Feature | Description |
|---|---|
| Vision-Language Model Integration | Google Gemini 2.5 Flash for multimodal content understanding |
| Semantic Duplicate Detection | Advanced similarity analysis using LangChain for content deduplication |
| Context-Aware Classification | Temporal context integration for accurate event status determination |
| Structured Output Generation | Pydantic-enforced schema validation for consistent data formatting |
| Multi-Prompt Engineering | Specialized prompts for event classification and similarity detection |
| Confidence Scoring | AI relevance determination with boolean confidence metrics |
| Feature | Description |
|---|---|
| MongoDB Atlas Integration | Cloud-native document storage with async Motor driver for high-performance operations |
| Cloudinary CDN Management | Automated image optimization, transformation, and global content delivery |
| Async Database Operations | Non-blocking database interactions for optimal performance |
| Event Collection Management | Dedicated collections for event data with indexing optimization |
| Backup & Recovery Systems | Automated data persistence with cloud redundancy |
Ensure you have the following installed on your development machine:
- Python 3.8+ (Latest stable version recommended)
- MongoDB Atlas Account (Cloud database)
- Cloudinary Account (Image CDN)
- Google AI API Key (Gemini access)
-
Clone the Repository
git clone https://github.com/AneeshAhuja31/ieee-automation.git cd ieee-automation -
Environment Setup
# Create virtual environment python -m venv venv # Activate virtual environment # Windows venv\Scripts\activate # macOS/Linux source venv/bin/activate
-
Install Dependencies
# Core dependencies pip install -r requirements.txt # Analyzer module dependencies pip install -r app/analyser/requirements.txt
-
Environment Configuration
Create
.envfile in the root directory:# Instagram Authentication SESSION_FILE="session.json" USERNAME="[email protected]" PASSWORD="your_instagram_password" TARGET_USER="ieeemsit" # Google AI API Keys (Multiple for load balancing) GEMINI_API_KEY_1="your_gemini_api_key_1" GEMINI_API_KEY_2="your_gemini_api_key_2" # Cloudinary Configuration CLOUDINARY_CLOUD_NAME="your_cloudinary_cloud_name" CLOUDINARY_API_KEY="your_cloudinary_api_key" CLOUDINARY_API_SECRET="your_cloudinary_api_secret" # MongoDB Atlas Configuration MONGODB_URI="mongodb+srv://username:[email protected]/" MONGODB_DATABASE_NAME="ieeemsit"
-
Run the Application
# Start FastAPI server cd app uvicorn app:app --reload --host 0.0.0.0 --port 8000 # Alternative: Run scraper independently python working.py
-
API Access
The application will be running at
http://localhost:8000- API Documentation:
http://localhost:8000/docs - Event Analysis Endpoint:
POST /analyse/jsons
- API Documentation:
POST /analyse/jsons
Processes a batch of Instagram posts and returns classified event data.
{
"json_list": [
{
"Post Image": "https://instagram.com/image_url",
"Post Caption": "Join us for our upcoming workshop...",
"Post Date": "2025-01-29"
}
]
}Response:
[
{
"title": "AI Workshop 2025",
"type": "workshop",
"category": "ai",
"status": "upcoming",
"startDate": "2025-02-15",
"venue": "MSIT Campus",
"registrationType": "free",
"isRelevant": true,
"cloudinary_url": "https://res.cloudinary.com/...",
"description": "Comprehensive workshop details..."
}
]-
Issue Management
All development begins with issue creation. Browse Issues or create new ones using our templates. -
Branch Strategy
Create feature branches following the[type]/[description]convention:git checkout -b feat/ai-model-upgrade git checkout -b fix/duplicate-detection-bug git checkout -b docs/api-documentation-update
-
Development Standards
- Write comprehensive docstrings for all functions
- Implement error handling for all external API calls
- Add type hints for improved code maintainability
- Follow PEP 8 style guidelines
-
Testing Requirements
# Run unit tests pytest tests/ # Run integration tests pytest tests/integration/ # Performance testing pytest tests/performance/
-
Pull Request Process
- Provide detailed PR descriptions with testing evidence
- Include performance impact analysis
- Ensure all CI/CD checks pass
- Request review from maintainers
- Type Safety: Full type annotation coverage
- Error Handling: Comprehensive exception management
- Documentation: Inline comments and API documentation
- Performance: Async/await patterns for I/O operations
- Security: Input validation and sanitization
Stay connected with IEEE MSIT's innovation and automation initiatives:
Contact: [email protected] | Phone: +91-11-2681-4816
Website: ieeemsit.vercel.app
Meet the engineering team behind this automation revolution:
|
Aneesh Ahuja PR Lead RAS |
Rajveer Singh Vice Chairperson - Web Dev |