AIOps Copilot is an end-to-end playground for anomaly detection, root-cause analysis, and observability on time-series coming from distributed services. The stack ships with a FastAPI backend, Streamlit dashboard, persistence (SQLite/Postgres), scripted pipelines, Grafana assets, and test suites.
- Real-time anomaly detection: rolling-median MAD baseline and forecast + conformal prediction, with per-series calibration.
- Root Cause Analysis (RCA): graph propagation with exponential decay, configurable via YAML or OpenTelemetry Tempo traces.
- FastAPI surface: ingest, batch detect, anomaly history, RCA ranking, dependency graph CRUD, Prometheus-style metrics.
- Streamlit UI: overview metrics, service drill-down with conformal bands, interactive dependency graph + RCA table.
- Turn-key datasets: NAB benchmark CSVs, Yahoo S5 miniature set, synthetic simulators, and a pre-built
data/aiops.db. - Operational scripts: showcase runners, graph applier, Evidently drift reports, simulators, and dataset seeders.
- CI-ready: lint (Ruff/Black/isort), unit + integration tests (Pytest), Dockerfiles for API and Streamlit.
flowchart LR
Clients[[Clients]]
API[FastAPI API]
DB[(Postgres / SQLite)]
Detectors[Detectors]
RCA[RCA Engine]
Evidently
Streamlit[Streamlit Dashboard]
Grafana
OTel[[Grafana / OTel / Tempo]]
Clients -->|HTTP| API
API -->|Persist| DB
API -->|Scores| Detectors
API -->|Propagation| RCA
API -->|Export| Evidently
API -->|REST| Streamlit
Streamlit -->|Dashboards| Grafana
Streamlit -->|Traces & Metrics| OTel
- Python >= 3.12
- pip/virtualenv
- Optional: Docker + Docker Compose, Tempo, Prometheus, Grafana
Clone the repository, ensure data/aiops.db is tracked (contains pre-computed NAB anomalies), then:
python -m venv .venv
. .venv/Scripts/activate # Windows PowerShell: .\.venv\Scripts\Activate.ps1
make setup # install dependencies (editable mode)
make run-api # start FastAPI with auto reload
make streamlit # launch the dashboard (API_BASE_URL defaults to http://localhost:8000)
Streamlit service drill-down with conformal prediction band and anomaly markers.
make setup installs the full dev toolchain, including pmdarima and statsforecast (AutoARIMA + conformal extras). On Windows, make sure Microsoft C++ Build Tools are available before running it.
The backend loads the NAB dependency graph from configs/data.yaml automatically. Existing anomalies and RCA scores stored in data/aiops.db become immediately visible in the UI.
If you prefer a runtime-only install (pip install -e .) you can add the forecast extras (AutoARIMA + StatsForecast) manually. make setup already pulls them in through the dev extras.
pip install -e .[forecast]On Windows you may need Microsoft C++ Build Tools before installing pmdarima.
| Command | Description |
|---|---|
make setup |
Upgrade pip and install editable project with dev extras |
make run-api |
Launch FastAPI (reload when single worker, controlled by UVICORN_*) |
make streamlit |
Open Streamlit dashboard (app/viz/dashboard.py) |
make lint / fmt |
Ruff + Black + isort (check or apply) |
make test |
Pytest with coverage (app pkg) |
make seed-nab |
Normalise NAB CSVs into data/nab/ |
make nab-detect |
Ingest/detect NAB services (baseline by default) |
make nab-showcase |
Full NAB ingest + detect + RCA summary + optional Evidently report |
make full-showcase |
Start API, apply graph, run showcase, optionally launch Streamlit |
make graph-nab |
Push configs/graphs/nab.yaml to the API |
make graph-default |
Restore microservices graph (configs/graphs/microservices.yaml) |
make simulate |
Generate synthetic metrics and run detection |
make report |
Generate Evidently drift report for checkout |
make docker-up |
Compose stack: API, Streamlit, Grafana |
make docker-down |
Stop and remove containers/volumes |
All CLI flags exposed in scripts are documented via --help. make nab-detect/nab-showcase accept overrides for detector (forecast or baseline), services list, worker count, chunk size, anomaly limits, and report generation.
| Method & Path | Purpose |
|---|---|
GET /health |
Health probe |
POST /ingest/series |
Persist measurements for one service (SeriesIngestRequest) |
POST /detect/batch |
Run detector (baseline or forecast + conformal) on multiple series |
GET /anomalies |
Fetch stored anomalies (filter by service, since, limit) |
GET /graph |
Retrieve dependency graph (nodes + weighted edges) |
POST /graph |
Merge/normalise incoming graph payload |
GET /rca/topk |
Ranked RCA scores (default k=5) |
GET /measurements |
Fetch raw measurements (service, optional since, limit) |
GET /metrics |
Prometheus exposition (ingested points, anomaly count, detect latency) |
API schemas live in app/models/schemas.py. Settings are controlled via .env (see .env.example) and YAML files in configs/.
curl -X POST http://localhost:8000/ingest/series \
-H "Content-Type: application/json" \
-d '{
"service_id": "checkout",
"points": [
{"ts": "2025-01-01T12:00:00Z", "y": 123.4},
{"ts": "2025-01-01T12:01:00Z", "y": 120.1}
]
}'
curl -X POST http://localhost:8000/detect/batch \
-H "Content-Type: application/json" \
-d '{
"series": [
{
"service_id": "checkout",
"points": [
{"ts": "2025-01-01T12:00:00Z", "y": 123.4},
{"ts": "2025-01-01T12:01:00Z", "y": 120.1}
]
}
],
"detector": "forecast",
"alpha": 0.1
}'
# inspect latest measurements
curl "http://localhost:8000/measurements?service_id=checkout&limit=200"- Ingestion:
/ingest/serieswrites rows intomeasurements(SQLAlchemy models inapp/io/writers.py). - Detection:
- Baseline: rolling median + MAD (
app/detectors/baseline.py). - Forecast: AutoARIMA (optional dependency) or Holt-Winters fallback + conformal calibration (
app/detectors/forecast.py,app/detectors/conformal.py).
- Baseline: rolling median + MAD (
- Persistence: detected anomalies stored in
anomaliestable, conformal ratios saved inapi_state.local_scores. - RCA: weighted directed graph (
app/rca/graph_builder.py) with exponential decay ranking (app/rca/rca_ranker.py). - Graph sources:
configs/data.yaml(default), YAML files underconfigs/graphs/, or Tempo traces (app/io/otel_tempo.py) if enabled.
To analyse your own workloads:
- Prepare the time series: each service metric needs an identifier (
service_id) and a list of{ts, y}points with ISO8601 timestamps. You can ingest in bulk via/ingest/series(see cURL above) or adaptscripts/run_nab_ingest.pyby pointing--data-dirto a folder of CSV files (columntimestamp/valueorts/y). - Run detection: call
/detect/batchor reuserun_nab_ingest.pywith--servicesenumerating the CSV basenames. Thedetectorflag toggles baseline vs forecast. - Customize the graph: adjust
configs/data.yamlfor static graphs, provide alternative YAML underconfigs/graphs/, or post new edges withscripts/apply_graph.py --graph <file>. If Tempo tracing is enabled (ENABLE_OTEL=trueandTEMPO_BASE_URLset), the API merges live traces into the graph. - Tune settings: override detection batch sizes, RCA decay, database URLs, etc., through
.envvariables (see.env.example).
Once ingested, all anomalies/RCA scores become visible in the Streamlit dashboard and Grafana panels.
- Overview: raw
/metricsoutput, active services list. - Service: raw measurements overlaid with predictions + conformal band, anomaly markers, key metrics (last anomaly, counts), optional residual view, adjustable history window.
- Compare: multi-service overlay constrained to the common time window (slider), plus per-service anomaly/severity summary.
- Graph: interactive PyVis network and expanded top-k RCA scores table.
Set API_BASE_URL to point to the FastAPI instance (defaults to http://localhost:8000).
scripts/seed_nab.py: copy NAB Real Known Cause CSVs, normalise timestamps.scripts/run_nab_ingest.py: threaded ingest+detect with configurable chunking and detector selection.scripts/run_nab_showcase.py: orchestrate ingest, anomalies fetch, RCA fetch, metrics dump, optional Evidently report.scripts/run_full_showcase.py: spin up uvicorn, apply graph (--graph-file), run showcase, optionally launch Streamlit.scripts/apply_graph.py: POST a YAML graph payload to the API.scripts/run_evidently_report.py: build Evidently drift report (reports/latest.html) retrieving measurements via SQL.scripts/simulate_services.py: generate synthetic multi-service data and run detection once.scripts/seed_yahoo_s5.py: ingest small Yahoo S5 subset.
Use these scripts directly or via the Makefile wrappers.
- Environment variables:
.envor system env (Pydantic Settings). Key options include DB URLs, detector parameters, feature toggles.- SQLite (default):
USE_POSTGRES=falsekeeps everything indata/aiops.db. - Postgres: set
USE_POSTGRES=trueand providePOSTGRES_HOST,POSTGRES_DB,POSTGRES_USER,POSTGRES_PASSWORD. Example:RunUSE_POSTGRES=true POSTGRES_HOST=localhost POSTGRES_DB=aiops POSTGRES_USER=aiops POSTGRES_PASSWORD=aiops
make docker-upto start the bundled Postgres container and Grafana dashboards, or point to your own instance.
- SQLite (default):
- YAML:
configs/app.yaml: app metadata, feature toggles, default ports, Postgres option, etc.configs/model.yaml: detector/conformal/baseline/RCA defaults.configs/data.yaml: default data sources and dependency graph (NAB loaded by default).configs/graphs/*.yaml: alternative graph topologies (NAB, microservices).
- Grafana: dashboards under
grafana/dashboards/, datasource definitions undergrafana/provisioning/.
data/aiops.db: SQLite database pre-populated via NAB showcase. Version it if you want users to skip long runs.data/nab/: normalised NAB CSVs (produced bymake seed-nab).reports/nab_summary.json: summary created byrun_nab_showcase.py.reports/latest.html: Evidently drift report (optional).
- Run
make testfor the full suite or targetedpytest tests/test_api.py::test_ingest_detect_and_rca_flow. - Lint via
make lint; auto-format withmake fmt. - GitHub Actions (
.github/workflows/ci.yml) performs lint, tests, and Docker image builds on main.
docker-compose.yml spins up the API, Streamlit app, Grafana (with mounted dashboards), and supporting services. Use the Make targets for lifecycle management. Provide the .env file and optional volumes (data/, reports/) to persist state.
Prometheus and Tempo endpoints can be configured through .env (ENABLE_PROMETHEUS, PROMETHEUS_BASE_URL, ENABLE_OTEL, TEMPO_BASE_URL). When enabled, the API fetches metrics/traces from those systems, and Grafana dashboards (grafana/dashboards/) render real-time views alongside the anomalies stored locally.
- Grafana default credentials:
admin/admin(prompted to change on first login). - Set
PROMETHEUS_BASE_URLorTEMPO_BASE_URLto remote instances if you want to reuse an existing observability stack. - Update
grafana/provisioning/datasources/datasources.yamlif your Postgres/Prometheus endpoints differ from the defaults exposed bydocker-compose.
- Enable Tempo/OTel by default for live graph reconstruction when traces are available.
- Add deep-learning detectors (N-BEATS, PatchTST) behind optional extras.
- Package reproducible demo datasets plus pre-built dashboards for easy sharing (e.g., Grafana JSON + Streamlit presets).
- Extend Prometheus exposure (per-service counters, pipeline metrics).