- Project Overview
- Technologies Used
- Data Pipeline
- Repository Structure
- Dashboard
- Acknowledgments
- Conclusion
- Contacts
This project implements a real-time data pipeline using Apache Kafka, Python's psutil library for metric collection, and PostgreSQL for data storage. It collects host metrics (CPU, memory, interrupts, network I/O, disk usage) via psutil, streams them through Kafka, and stores them in PostgreSQL using SQLAlchemy. A materialized view aggregates metrics over 5-minute windows, refreshed by Airflow. Grafana provides a real-time dashboard with threshold-based alerts for monitoring.
- FastAPI: Develop internal APIs to expose data endpoints consumed by Airflow DAGs.
- Python: Utilized the psutil library for collecting metrics data and Kafka Python client for producing and consuming messages.
- Airflow: Orchestrated workflows that fetch metrics via FastAPI, process streaming data from Kafka.
- Apache Kafka: Implemented a distributed streaming platform to handle real-time data processing and communication between producers and consumers.
- Apache Zookeeper: Used for coordinating and managing Kafka brokers.
- Control Center: Provided UI dashboard to monitor the data flow between producers, topics, and consumers.
- Postgres: Stored and managed the collected metrics data in a relational database.
- Grafana: Connected to the Postgres database to visualize real-time metrics and create the dashboard.
- Slack Webhook: Sent Airflow logs and Grafana alerts to Slack for real-time monitoring and incident response.
The data pipeline is structured as follows:
-
Data Ingestion
- Metrics are collected on the local host via a FastAPI endpoint using the
psutillibrary.
- Metrics are collected on the local host via a FastAPI endpoint using the
-
Data Production
- The FastAPI service serializes each metric snapshot as JSON and publishes it into the Kafka topic
Trackingusing Python’skafka-pythonclient.
- The FastAPI service serializes each metric snapshot as JSON and publishes it into the Kafka topic
-
Bronze Layer (Raw Storage)
- A Python Kafka consumer (built with SQLAlchemy) reads from
Trackingand writes every raw JSON record into thebronze.bronze_performancetable in PostgreSQL.
- A Python Kafka consumer (built with SQLAlchemy) reads from
-
Silver Layer (Cleansing & Normalization)
- Immediately after insertion, a transform routine filters out any records with null or out-of-range values, normalizes percentages (e.g. divides “cpu_usage” by 100), converts bytes fields as needed, and writes the cleaned data into
silver.silver_performance.
- Immediately after insertion, a transform routine filters out any records with null or out-of-range values, normalizes percentages (e.g. divides “cpu_usage” by 100), converts bytes fields as needed, and writes the cleaned data into
-
Gold Layer (Aggregation)
- A materialized view
gold.mv_perf_5min_summaryaggregates the silver data into 5-minute windows, computing:- average, max, and 95th-percentile for CPU & memory
- total bytes sent/received
- anomaly flags when metrics exceed predefined thresholds
- A materialized view
-
Orchestration & Alerting
- An Airflow DAG runs every 30 minutes to refresh materialized.
- The DAG is configured with email and Slack alerts on failure or SLA miss to ensure pipeline health.
-
Visualization & Monitoring
- Grafana connects to the PostgreSQL data source, queries the
silver_performancetable for sub-minute panels and the gold materialized view for 5-minute summaries, and renders live time-series dashboards (CPU, memory, network I/O, disk) with threshold-based alerting.
- Grafana connects to the PostgreSQL data source, queries the
.
├── Dockerfile
├── app
│ └── main.py
├── data
│ └── airflow
│ ├── config
│ ├── data
│ └── plugins
├── docker-compose.yaml
├── load_dataset_into_postgres
├── pipeline
│ └── dags
│ ├── pipelines.py
│ └── test.py
├── requirements.txt
└── scripts
└── pj
├── __init__.py
├── consumer.py
├── monitor.py
└── producer.pyA seamless real‐time pipeline that captures every CPU, memory, network, and disk metric, cleans and normalizes it through Bronze→Silver tables, and then rolls up 5-minute aggregates in a Gold materialized view. Airflow quietly refreshes the view on schedule, Kafka guarantees no data loss, and Grafana brings it all to life with live charts and alerts the moment anything crosses my thresholds. It’s an end-to-end solution I can trust today and extend tomorrow as my infrastructure grows.
For any informations, please contact:
- Email: lecongkhanh242003@gmail.com
- LinkedIn: Here

