Skip to content

imbilalbutt/Data_Engineering_for_News_Sentiment_Analysis

Repository files navigation

Data Engineering for News and Financial Sentiment Analysis

A high-performance data engineering pipeline built with modern big data tools to collect, process, and prepare news data for machine learning models, specifically for sentiment analysis.

📋 Project Overview

This project implements an end-to-end data pipeline. It extract data of two kinds: 1 CSV based for news sentiment analysis 2 Numerical financial data

It ingests raw news data, performs distributed data processing and transformation using PySpark, orchestrates workflows with Apache Airflow, and prepares clean, structured datasets ready for training ML models to extract sentiment insights from news content and can generate financial signals.

Key Features:

  • Multi-Environment Support: Run pipelines locally, on Databricks, or using Docker containers
  • Scalable Architecture: Leverages PySpark for distributed processing of large news datasets
  • Workflow Orchestration: Apache Airflow for scheduling and monitoring pipeline execution
  • Cloud Integration: Designed to work with AWS cloud services
  • Production-Ready: Includes Docker support for containerized deployment

🛠️ Technology Stack

  • Apache Airflow - Workflow orchestration and scheduling
  • PySpark/Spark - Distributed data processing engine
  • Databricks - Cloud-based Spark platform (optional)
  • AWS - Cloud infrastructure and services
  • Python - Primary programming language (81.4% of codebase)
  • Docker - Containerization for reproducible environments
  • Shell Scripts - Automation and pipeline execution (15.5%)

📁 Project Structure

Data_Engineering_for_News_Sentiment_Analysis/
├── Airflow/                    # Apache Airflow DAGs and configurations
├── Data_Pipeline_Simple/       # Simplified pipeline implementation
├── Data_pipeline_PySpark/      # PySpark-based data processing pipelines
├── Databricks/                 # Databricks-specific notebooks and code
├── Querying/                   # Data querying examples and utilities
├── .gitattributes              # Git configuration
├── .gitignore                  # Git ignore rules
├── Dockerfile                  # Docker container definition
├── docker-compose.yaml         # Multi-container Docker setup
├── LICENSE                     # Project license
├── pipeline_runner_docker.sh   # Script to run pipeline with Docker
├── pipeline_runner_locally_without_docker.sh  # Local execution script
├── requirements.txt            # Python dependencies
└── aggregate_example.py        # Example aggregation operations

🚀 Getting Started

Prerequisites

  • Python 3.8+
  • Apache Spark (for local execution)
  • Docker and Docker Compose (for containerized deployment)
  • Apache Airflow (included in Docker setup)

Installation & Setup

Option 1: Local Execution (Without Docker)

# Clone the repository
git clone https://github.com/imbilalbutt/Data_Engineering_for_News_Sentiment_Analysis.git
cd Data_Engineering_for_News_Sentiment_Analysis

# Install dependencies
pip install -r requirements.txt

# Run the pipeline locally
./pipeline_runner_locally_without_docker.sh

Option 2: Docker Deployment

# Build and start containers
docker-compose up -d

# Run the pipeline using Docker
./pipeline_runner_docker.sh

Option 3: Databricks Deployment

  1. Upload the notebooks from the Databricks/ directory to your Databricks workspace
  2. Configure cluster with appropriate Spark version
  3. Set up data sources and execute notebooks

🔧 Pipeline Architecture

The data pipeline follows these key stages:

  1. Data Ingestion: Collect news data from various sources
  2. Data Processing: Clean, transform, and enrich raw data using PySpark
  3. Feature Engineering: Extract relevant features for sentiment analysis
  4. Data Storage: Store processed data in optimized formats for ML consumption
  5. Workflow Orchestration: Apache Airflow manages pipeline dependencies and scheduling

📊 Usage Examples

Running Basic Aggregations

# Example from aggregate_example.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, avg

# Initialize Spark session
spark = SparkSession.builder.appName("NewsSentimentAnalysis").getOrCreate()

# Perform aggregations on news data
# ... aggregation logic here

Executing Airflow DAGs

Access the Airflow web UI (typically at http://localhost:8080) to:

  • Monitor pipeline execution
  • Trigger manual runs
  • View task dependencies and logs
  • Schedule regular pipeline execution

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📝 License

This project is licensed under the terms specified in the LICENSE file.

📈 Project Status

  • Last Updated: December 16, 2025
  • Recent Activity: Bug fixes and updates to Airflow DAGs, PySpark pipelines, and Docker configurations
  • Total Commits: 14
  • Primary Language: Python (81.4%)

📬 Contact

Project Link: https://github.com/imbilalbutt/Data_Engineering_for_News_Sentiment_Analysis


This project demonstrates a production-ready data engineering pipeline for preparing news data for sentiment analysis ML models using industry-standard tools and practices.

About

A data engineering project created using all important tools such as Apache Airflow, PySpark, Spark, Databricks, AWS. It prepares data for ML models to extract sentiment analysis.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors