Skip to content

Gagan-KM/DataStreamX-Real-time-data-streaming-powered-by-Airflow-Kafka-and-Spark.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataStreamX

This project showcases a robust real-time data processing pipeline leveraging a combination of modern technologies for efficient data ingestion, streaming, processing, and storage. The pipeline begins with data ingestion through an API, where the incoming data is stored in a PostgreSQL database. Apache Airflow orchestrates the entire workflow, managing the streaming of data from the API to Apache Kafka. Kafka serves as the central hub for streaming data, distributing it to various consumers, with Zookeeper handling Kafka’s configuration management and broker coordination. Once the data is streamed through Kafka, it is processed in real-time using Apache Spark, which operates across multiple worker nodes to ensure distributed and scalable data processing. The processed data is then stored in Cassandra, a distributed NoSQL database, which allows for high availability and easy access to the processed data for further analysis. The pipeline also includes a Control Centre for monitoring and managing Kafka clusters, along with a Schema Registry to manage the data schema across Kafka topics, ensuring data compatibility throughout the pipeline. The entire architecture is containerized using Docker, which simplifies deployment and scaling, making the system both flexible and efficient in handling large-scale, real-time data processing tasks.

image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published