Data Engineering for News and Financial Sentiment Analysis

A high-performance data engineering pipeline built with modern big data tools to collect, process, and prepare news data for machine learning models, specifically for sentiment analysis.

📋 Project Overview

This project implements an end-to-end data pipeline. It extract data of two kinds: 1 CSV based for news sentiment analysis 2 Numerical financial data

It ingests raw news data, performs distributed data processing and transformation using PySpark, orchestrates workflows with Apache Airflow, and prepares clean, structured datasets ready for training ML models to extract sentiment insights from news content and can generate financial signals.

Key Features:

Multi-Environment Support: Run pipelines locally, on Databricks, or using Docker containers
Scalable Architecture: Leverages PySpark for distributed processing of large news datasets
Workflow Orchestration: Apache Airflow for scheduling and monitoring pipeline execution
Cloud Integration: Designed to work with AWS cloud services
Production-Ready: Includes Docker support for containerized deployment

🛠️ Technology Stack

Apache Airflow - Workflow orchestration and scheduling
PySpark/Spark - Distributed data processing engine
Databricks - Cloud-based Spark platform (optional)
AWS - Cloud infrastructure and services
Python - Primary programming language (81.4% of codebase)
Docker - Containerization for reproducible environments
Shell Scripts - Automation and pipeline execution (15.5%)

📁 Project Structure

Data_Engineering_for_News_Sentiment_Analysis/
├── Airflow/                    # Apache Airflow DAGs and configurations
├── Data_Pipeline_Simple/       # Simplified pipeline implementation
├── Data_pipeline_PySpark/      # PySpark-based data processing pipelines
├── Databricks/                 # Databricks-specific notebooks and code
├── Querying/                   # Data querying examples and utilities
├── .gitattributes              # Git configuration
├── .gitignore                  # Git ignore rules
├── Dockerfile                  # Docker container definition
├── docker-compose.yaml         # Multi-container Docker setup
├── LICENSE                     # Project license
├── pipeline_runner_docker.sh   # Script to run pipeline with Docker
├── pipeline_runner_locally_without_docker.sh  # Local execution script
├── requirements.txt            # Python dependencies
└── aggregate_example.py        # Example aggregation operations

🚀 Getting Started

Prerequisites

Python 3.8+
Apache Spark (for local execution)
Docker and Docker Compose (for containerized deployment)
Apache Airflow (included in Docker setup)

Installation & Setup

Option 1: Local Execution (Without Docker)

# Clone the repository
git clone https://github.com/imbilalbutt/Data_Engineering_for_News_Sentiment_Analysis.git
cd Data_Engineering_for_News_Sentiment_Analysis

# Install dependencies
pip install -r requirements.txt

# Run the pipeline locally
./pipeline_runner_locally_without_docker.sh

Option 2: Docker Deployment

# Build and start containers
docker-compose up -d

# Run the pipeline using Docker
./pipeline_runner_docker.sh

Option 3: Databricks Deployment

Upload the notebooks from the Databricks/ directory to your Databricks workspace
Configure cluster with appropriate Spark version
Set up data sources and execute notebooks

🔧 Pipeline Architecture

The data pipeline follows these key stages:

Data Ingestion: Collect news data from various sources
Data Processing: Clean, transform, and enrich raw data using PySpark
Feature Engineering: Extract relevant features for sentiment analysis
Data Storage: Store processed data in optimized formats for ML consumption
Workflow Orchestration: Apache Airflow manages pipeline dependencies and scheduling

📊 Usage Examples

Running Basic Aggregations

# Example from aggregate_example.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import count, avg

# Initialize Spark session
spark = SparkSession.builder.appName("NewsSentimentAnalysis").getOrCreate()

# Perform aggregations on news data
# ... aggregation logic here

Executing Airflow DAGs

Access the Airflow web UI (typically at http://localhost:8080) to:

Monitor pipeline execution
Trigger manual runs
View task dependencies and logs
Schedule regular pipeline execution

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📝 License

This project is licensed under the terms specified in the LICENSE file.

📈 Project Status

Last Updated: December 16, 2025
Recent Activity: Bug fixes and updates to Airflow DAGs, PySpark pipelines, and Docker configurations
Total Commits: 14
Primary Language: Python (81.4%)

📬 Contact

Project Link: https://github.com/imbilalbutt/Data_Engineering_for_News_Sentiment_Analysis

This project demonstrates a production-ready data engineering pipeline for preparing news data for sentiment analysis ML models using industry-standard tools and practices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering for News and Financial Sentiment Analysis

📋 Project Overview

Key Features:

🛠️ Technology Stack

📁 Project Structure

🚀 Getting Started

Prerequisites

Installation & Setup

Option 1: Local Execution (Without Docker)

Option 2: Docker Deployment

Option 3: Databricks Deployment

🔧 Pipeline Architecture

📊 Usage Examples

Running Basic Aggregations

Executing Airflow DAGs

🤝 Contributing

📝 License

📈 Project Status

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Airflow		Airflow
Data_Pipeline_Simple		Data_Pipeline_Simple
Data_pipeline_PySpark		Data_pipeline_PySpark
Databricks		Databricks
Querying		Querying
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
pipeline_runner_docker.sh		pipeline_runner_docker.sh
pipeline_runner_locally_without_docker.sh		pipeline_runner_locally_without_docker.sh
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Data Engineering for News and Financial Sentiment Analysis

📋 Project Overview

Key Features:

🛠️ Technology Stack

📁 Project Structure

🚀 Getting Started

Prerequisites

Installation & Setup

Option 1: Local Execution (Without Docker)

Option 2: Docker Deployment

Option 3: Databricks Deployment

🔧 Pipeline Architecture

📊 Usage Examples

Running Basic Aggregations

Executing Airflow DAGs

🤝 Contributing

📝 License

📈 Project Status

📬 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages