This project showcases a robust real-time data processing pipeline leveraging a combination of modern technologies for efficient data ingestion, streaming, processing, and storage. The pipeline begins with data ingestion through an API, where the incoming data is stored in a PostgreSQL database. Apache Airflow orchestrates the entire workflow, managing the streaming of data from the API to Apache Kafka. Kafka serves as the central hub for streaming data, distributing it to various consumers, with Zookeeper handling Kafka’s configuration management and broker coordination. Once the data is streamed through Kafka, it is processed in real-time using Apache Spark, which operates across multiple worker nodes to ensure distributed and scalable data processing. The processed data is then stored in Cassandra, a distributed NoSQL database, which allows for high availability and easy access to the processed data for further analysis. The pipeline also includes a Control Centre for monitoring and managing Kafka clusters, along with a Schema Registry to manage the data schema across Kafka topics, ensuring data compatibility throughout the pipeline. The entire architecture is containerized using Docker, which simplifies deployment and scaling, making the system both flexible and efficient in handling large-scale, real-time data processing tasks.
-
Notifications
You must be signed in to change notification settings - Fork 0
Gagan-KM/DataStreamX-Real-time-data-streaming-powered-by-Airflow-Kafka-and-Spark.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published