Skip to content
/ AIOps Public

AIOps Copilot - real-time anomaly detection and root-cause ranking for microservices (FastAPI backend, Streamlit demo, NAB dataset showcase).

License

Notifications You must be signed in to change notification settings

RAPHCVR/AIOps

Repository files navigation

AIOps Copilot

AIOps Copilot is an end-to-end playground for anomaly detection, root-cause analysis, and observability on time-series coming from distributed services. The stack ships with a FastAPI backend, Streamlit dashboard, persistence (SQLite/Postgres), scripted pipelines, Grafana assets, and test suites.

Highlights

  • Real-time anomaly detection: rolling-median MAD baseline and forecast + conformal prediction, with per-series calibration.
  • Root Cause Analysis (RCA): graph propagation with exponential decay, configurable via YAML or OpenTelemetry Tempo traces.
  • FastAPI surface: ingest, batch detect, anomaly history, RCA ranking, dependency graph CRUD, Prometheus-style metrics.
  • Streamlit UI: overview metrics, service drill-down with conformal bands, interactive dependency graph + RCA table.
  • Turn-key datasets: NAB benchmark CSVs, Yahoo S5 miniature set, synthetic simulators, and a pre-built data/aiops.db.
  • Operational scripts: showcase runners, graph applier, Evidently drift reports, simulators, and dataset seeders.
  • CI-ready: lint (Ruff/Black/isort), unit + integration tests (Pytest), Dockerfiles for API and Streamlit.

Architecture at a Glance

flowchart LR
    Clients[[Clients]]
    API[FastAPI API]
    DB[(Postgres / SQLite)]
    Detectors[Detectors]
    RCA[RCA Engine]
    Evidently
    Streamlit[Streamlit Dashboard]
    Grafana
    OTel[[Grafana / OTel / Tempo]]

    Clients -->|HTTP| API
    API -->|Persist| DB
    API -->|Scores| Detectors
    API -->|Propagation| RCA
    API -->|Export| Evidently
    API -->|REST| Streamlit
    Streamlit -->|Dashboards| Grafana
    Streamlit -->|Traces & Metrics| OTel
Loading

Prerequisites

  • Python >= 3.12
  • pip/virtualenv
  • Optional: Docker + Docker Compose, Tempo, Prometheus, Grafana

Quickstart (No Re-training)

Clone the repository, ensure data/aiops.db is tracked (contains pre-computed NAB anomalies), then:

python -m venv .venv
. .venv/Scripts/activate        # Windows PowerShell: .\.venv\Scripts\Activate.ps1
make setup                      # install dependencies (editable mode)

make run-api                    # start FastAPI with auto reload
make streamlit                  # launch the dashboard (API_BASE_URL defaults to http://localhost:8000)

Streamlit anomaly drill-down Streamlit service drill-down with conformal prediction band and anomaly markers.

make setup installs the full dev toolchain, including pmdarima and statsforecast (AutoARIMA + conformal extras). On Windows, make sure Microsoft C++ Build Tools are available before running it.

The backend loads the NAB dependency graph from configs/data.yaml automatically. Existing anomalies and RCA scores stored in data/aiops.db become immediately visible in the UI.

Optional dependencies

If you prefer a runtime-only install (pip install -e .) you can add the forecast extras (AutoARIMA + StatsForecast) manually. make setup already pulls them in through the dev extras.

pip install -e .[forecast]

On Windows you may need Microsoft C++ Build Tools before installing pmdarima.

Make Targets Reference

Command Description
make setup Upgrade pip and install editable project with dev extras
make run-api Launch FastAPI (reload when single worker, controlled by UVICORN_*)
make streamlit Open Streamlit dashboard (app/viz/dashboard.py)
make lint / fmt Ruff + Black + isort (check or apply)
make test Pytest with coverage (app pkg)
make seed-nab Normalise NAB CSVs into data/nab/
make nab-detect Ingest/detect NAB services (baseline by default)
make nab-showcase Full NAB ingest + detect + RCA summary + optional Evidently report
make full-showcase Start API, apply graph, run showcase, optionally launch Streamlit
make graph-nab Push configs/graphs/nab.yaml to the API
make graph-default Restore microservices graph (configs/graphs/microservices.yaml)
make simulate Generate synthetic metrics and run detection
make report Generate Evidently drift report for checkout
make docker-up Compose stack: API, Streamlit, Grafana
make docker-down Stop and remove containers/volumes

All CLI flags exposed in scripts are documented via --help. make nab-detect/nab-showcase accept overrides for detector (forecast or baseline), services list, worker count, chunk size, anomaly limits, and report generation.

FastAPI Endpoints

Method & Path Purpose
GET /health Health probe
POST /ingest/series Persist measurements for one service (SeriesIngestRequest)
POST /detect/batch Run detector (baseline or forecast + conformal) on multiple series
GET /anomalies Fetch stored anomalies (filter by service, since, limit)
GET /graph Retrieve dependency graph (nodes + weighted edges)
POST /graph Merge/normalise incoming graph payload
GET /rca/topk Ranked RCA scores (default k=5)
GET /measurements Fetch raw measurements (service, optional since, limit)
GET /metrics Prometheus exposition (ingested points, anomaly count, detect latency)

API schemas live in app/models/schemas.py. Settings are controlled via .env (see .env.example) and YAML files in configs/.

Sample requests

curl -X POST http://localhost:8000/ingest/series \
  -H "Content-Type: application/json" \
  -d '{
        "service_id": "checkout",
        "points": [
          {"ts": "2025-01-01T12:00:00Z", "y": 123.4},
          {"ts": "2025-01-01T12:01:00Z", "y": 120.1}
        ]
      }'

curl -X POST http://localhost:8000/detect/batch \
  -H "Content-Type: application/json" \
  -d '{
        "series": [
          {
            "service_id": "checkout",
            "points": [
              {"ts": "2025-01-01T12:00:00Z", "y": 123.4},
              {"ts": "2025-01-01T12:01:00Z", "y": 120.1}
            ]
          }
        ],
        "detector": "forecast",
        "alpha": 0.1
      }'

# inspect latest measurements
curl "http://localhost:8000/measurements?service_id=checkout&limit=200"

Detection & RCA Pipeline

  1. Ingestion: /ingest/series writes rows into measurements (SQLAlchemy models in app/io/writers.py).
  2. Detection:
    • Baseline: rolling median + MAD (app/detectors/baseline.py).
    • Forecast: AutoARIMA (optional dependency) or Holt-Winters fallback + conformal calibration (app/detectors/forecast.py, app/detectors/conformal.py).
  3. Persistence: detected anomalies stored in anomalies table, conformal ratios saved in api_state.local_scores.
  4. RCA: weighted directed graph (app/rca/graph_builder.py) with exponential decay ranking (app/rca/rca_ranker.py).
  5. Graph sources: configs/data.yaml (default), YAML files under configs/graphs/, or Tempo traces (app/io/otel_tempo.py) if enabled.

Bring Your Own Data

To analyse your own workloads:

  1. Prepare the time series: each service metric needs an identifier (service_id) and a list of {ts, y} points with ISO8601 timestamps. You can ingest in bulk via /ingest/series (see cURL above) or adapt scripts/run_nab_ingest.py by pointing --data-dir to a folder of CSV files (column timestamp/value or ts/y).
  2. Run detection: call /detect/batch or reuse run_nab_ingest.py with --services enumerating the CSV basenames. The detector flag toggles baseline vs forecast.
  3. Customize the graph: adjust configs/data.yaml for static graphs, provide alternative YAML under configs/graphs/, or post new edges with scripts/apply_graph.py --graph <file>. If Tempo tracing is enabled (ENABLE_OTEL=true and TEMPO_BASE_URL set), the API merges live traces into the graph.
  4. Tune settings: override detection batch sizes, RCA decay, database URLs, etc., through .env variables (see .env.example).

Once ingested, all anomalies/RCA scores become visible in the Streamlit dashboard and Grafana panels.

Streamlit Dashboard

  • Overview: raw /metrics output, active services list.
  • Service: raw measurements overlaid with predictions + conformal band, anomaly markers, key metrics (last anomaly, counts), optional residual view, adjustable history window.
  • Compare: multi-service overlay constrained to the common time window (slider), plus per-service anomaly/severity summary.
  • Graph: interactive PyVis network and expanded top-k RCA scores table.

Set API_BASE_URL to point to the FastAPI instance (defaults to http://localhost:8000).

Scripts & Pipelines

  • scripts/seed_nab.py: copy NAB Real Known Cause CSVs, normalise timestamps.
  • scripts/run_nab_ingest.py: threaded ingest+detect with configurable chunking and detector selection.
  • scripts/run_nab_showcase.py: orchestrate ingest, anomalies fetch, RCA fetch, metrics dump, optional Evidently report.
  • scripts/run_full_showcase.py: spin up uvicorn, apply graph (--graph-file), run showcase, optionally launch Streamlit.
  • scripts/apply_graph.py: POST a YAML graph payload to the API.
  • scripts/run_evidently_report.py: build Evidently drift report (reports/latest.html) retrieving measurements via SQL.
  • scripts/simulate_services.py: generate synthetic multi-service data and run detection once.
  • scripts/seed_yahoo_s5.py: ingest small Yahoo S5 subset.

Use these scripts directly or via the Makefile wrappers.

Configuration

  • Environment variables: .env or system env (Pydantic Settings). Key options include DB URLs, detector parameters, feature toggles.
    • SQLite (default): USE_POSTGRES=false keeps everything in data/aiops.db.
    • Postgres: set USE_POSTGRES=true and provide POSTGRES_HOST, POSTGRES_DB, POSTGRES_USER, POSTGRES_PASSWORD. Example:
      USE_POSTGRES=true
      POSTGRES_HOST=localhost
      POSTGRES_DB=aiops
      POSTGRES_USER=aiops
      POSTGRES_PASSWORD=aiops
      Run make docker-up to start the bundled Postgres container and Grafana dashboards, or point to your own instance.
  • YAML:
    • configs/app.yaml: app metadata, feature toggles, default ports, Postgres option, etc.
    • configs/model.yaml: detector/conformal/baseline/RCA defaults.
    • configs/data.yaml: default data sources and dependency graph (NAB loaded by default).
    • configs/graphs/*.yaml: alternative graph topologies (NAB, microservices).
  • Grafana: dashboards under grafana/dashboards/, datasource definitions under grafana/provisioning/.

Data Artifacts

  • data/aiops.db: SQLite database pre-populated via NAB showcase. Version it if you want users to skip long runs.
  • data/nab/: normalised NAB CSVs (produced by make seed-nab).
  • reports/nab_summary.json: summary created by run_nab_showcase.py.
  • reports/latest.html: Evidently drift report (optional).

Testing & Quality

  • Run make test for the full suite or targeted pytest tests/test_api.py::test_ingest_detect_and_rca_flow.
  • Lint via make lint; auto-format with make fmt.
  • GitHub Actions (.github/workflows/ci.yml) performs lint, tests, and Docker image builds on main.

Docker Compose

docker-compose.yml spins up the API, Streamlit app, Grafana (with mounted dashboards), and supporting services. Use the Make targets for lifecycle management. Provide the .env file and optional volumes (data/, reports/) to persist state.

Prometheus and Tempo endpoints can be configured through .env (ENABLE_PROMETHEUS, PROMETHEUS_BASE_URL, ENABLE_OTEL, TEMPO_BASE_URL). When enabled, the API fetches metrics/traces from those systems, and Grafana dashboards (grafana/dashboards/) render real-time views alongside the anomalies stored locally.

  • Grafana default credentials: admin/admin (prompted to change on first login).
  • Set PROMETHEUS_BASE_URL or TEMPO_BASE_URL to remote instances if you want to reuse an existing observability stack.
  • Update grafana/provisioning/datasources/datasources.yaml if your Postgres/Prometheus endpoints differ from the defaults exposed by docker-compose.

Roadmap & Known Gaps

  1. Enable Tempo/OTel by default for live graph reconstruction when traces are available.
  2. Add deep-learning detectors (N-BEATS, PatchTST) behind optional extras.
  3. Package reproducible demo datasets plus pre-built dashboards for easy sharing (e.g., Grafana JSON + Streamlit presets).
  4. Extend Prometheus exposure (per-service counters, pipeline metrics).

About

AIOps Copilot - real-time anomaly detection and root-cause ranking for microservices (FastAPI backend, Streamlit demo, NAB dataset showcase).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published