A high-performance data engineering pipeline built with modern big data tools to collect, process, and prepare news data for machine learning models, specifically for sentiment analysis.
This project implements an end-to-end data pipeline. It extract data of two kinds: 1 CSV based for news sentiment analysis 2 Numerical financial data
It ingests raw news data, performs distributed data processing and transformation using PySpark, orchestrates workflows with Apache Airflow, and prepares clean, structured datasets ready for training ML models to extract sentiment insights from news content and can generate financial signals.
- Multi-Environment Support: Run pipelines locally, on Databricks, or using Docker containers
- Scalable Architecture: Leverages PySpark for distributed processing of large news datasets
- Workflow Orchestration: Apache Airflow for scheduling and monitoring pipeline execution
- Cloud Integration: Designed to work with AWS cloud services
- Production-Ready: Includes Docker support for containerized deployment
- Apache Airflow - Workflow orchestration and scheduling
- PySpark/Spark - Distributed data processing engine
- Databricks - Cloud-based Spark platform (optional)
- AWS - Cloud infrastructure and services
- Python - Primary programming language (81.4% of codebase)
- Docker - Containerization for reproducible environments
- Shell Scripts - Automation and pipeline execution (15.5%)
Data_Engineering_for_News_Sentiment_Analysis/
├── Airflow/ # Apache Airflow DAGs and configurations
├── Data_Pipeline_Simple/ # Simplified pipeline implementation
├── Data_pipeline_PySpark/ # PySpark-based data processing pipelines
├── Databricks/ # Databricks-specific notebooks and code
├── Querying/ # Data querying examples and utilities
├── .gitattributes # Git configuration
├── .gitignore # Git ignore rules
├── Dockerfile # Docker container definition
├── docker-compose.yaml # Multi-container Docker setup
├── LICENSE # Project license
├── pipeline_runner_docker.sh # Script to run pipeline with Docker
├── pipeline_runner_locally_without_docker.sh # Local execution script
├── requirements.txt # Python dependencies
└── aggregate_example.py # Example aggregation operations
- Python 3.8+
- Apache Spark (for local execution)
- Docker and Docker Compose (for containerized deployment)
- Apache Airflow (included in Docker setup)
# Clone the repository
git clone https://github.com/imbilalbutt/Data_Engineering_for_News_Sentiment_Analysis.git
cd Data_Engineering_for_News_Sentiment_Analysis
# Install dependencies
pip install -r requirements.txt
# Run the pipeline locally
./pipeline_runner_locally_without_docker.sh# Build and start containers
docker-compose up -d
# Run the pipeline using Docker
./pipeline_runner_docker.sh- Upload the notebooks from the
Databricks/directory to your Databricks workspace - Configure cluster with appropriate Spark version
- Set up data sources and execute notebooks
The data pipeline follows these key stages:
- Data Ingestion: Collect news data from various sources
- Data Processing: Clean, transform, and enrich raw data using PySpark
- Feature Engineering: Extract relevant features for sentiment analysis
- Data Storage: Store processed data in optimized formats for ML consumption
- Workflow Orchestration: Apache Airflow manages pipeline dependencies and scheduling
# Example from aggregate_example.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, avg
# Initialize Spark session
spark = SparkSession.builder.appName("NewsSentimentAnalysis").getOrCreate()
# Perform aggregations on news data
# ... aggregation logic hereAccess the Airflow web UI (typically at http://localhost:8080) to:
- Monitor pipeline execution
- Trigger manual runs
- View task dependencies and logs
- Schedule regular pipeline execution
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the terms specified in the LICENSE file.
- Last Updated: December 16, 2025
- Recent Activity: Bug fixes and updates to Airflow DAGs, PySpark pipelines, and Docker configurations
- Total Commits: 14
- Primary Language: Python (81.4%)
Project Link: https://github.com/imbilalbutt/Data_Engineering_for_News_Sentiment_Analysis
This project demonstrates a production-ready data engineering pipeline for preparing news data for sentiment analysis ML models using industry-standard tools and practices.