Skip to content

feat: implement observability log alertgroups #785

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 29, 2025

Conversation

h3adex
Copy link
Contributor

@h3adex h3adex commented Apr 17, 2025

Description

This PR introduces the ability to create log alert groups with rules that define when alerts should be triggered within our STACKIT Observability stack. The implementation is based on the API documentation available here:
https://docs.api.stackit.cloud/documentation/argus/version/v1#tag/logs/operation/v1_projects_instances_logs-alertgroups_create

Note that this PR is very similiar to the one opened a week ago with alertgroups. The APIs are similiar. The only thing which is different is that alertgroups are expecting PromQL expressions for metrics (Thanos) and logalertgroups are expecting LogQL expression for logs (Loki).

Given the broad applicability of this feature and frequent customer requests for its implementation, I have also included a guide demonstrating how to use it directly with SKE to send log alerts with the help of promtail. This guide should also help you testing it in a real world scenario.

Terraform Code to test it out.

variable "stackit_service_account_key_path" {
  type    = string
  default = "XXX"
}

variable "stackit_project_id" {
  type    = string
  default = "XXX"
}

provider "stackit" {
  default_region           = "eu01"
  service_account_key_path = var.stackit_service_account_key_path
}

provider "kubernetes" {
  host                   = yamldecode(stackit_ske_kubeconfig.example.kube_config).clusters.0.cluster.server
  client_certificate     = base64decode(yamldecode(stackit_ske_kubeconfig.example.kube_config).users.0.user.client-certificate-data)
  client_key             = base64decode(yamldecode(stackit_ske_kubeconfig.example.kube_config).users.0.user.client-key-data)
  cluster_ca_certificate = base64decode(yamldecode(stackit_ske_kubeconfig.example.kube_config).clusters.0.cluster.certificate-authority-data)
}

provider "helm" {
  kubernetes {
    host                   = yamldecode(stackit_ske_kubeconfig.example.kube_config).clusters.0.cluster.server
    client_certificate     = base64decode(yamldecode(stackit_ske_kubeconfig.example.kube_config).users.0.user.client-certificate-data)
    client_key             = base64decode(yamldecode(stackit_ske_kubeconfig.example.kube_config).users.0.user.client-key-data)
    cluster_ca_certificate = base64decode(yamldecode(stackit_ske_kubeconfig.example.kube_config).clusters.0.cluster.certificate-authority-data)
  }
}

resource "stackit_ske_cluster" "example" {
  project_id         = var.stackit_project_id
  name               = "example-name"
  kubernetes_version = "1.31"
  node_pools = [
    {
      name               = "standard"
      machine_type       = "c1.4"
      minimum            = "3"
      maximum            = "9"
      max_surge          = "3"
      availability_zones = ["eu01-1", "eu01-2", "eu01-3"]
      os_version_min     = "4081.2.1"
      os_name            = "flatcar"
      volume_size        = 32
      volume_type        = "storage_premium_perf6"
    }
  ]
  maintenance = {
    enable_kubernetes_version_updates    = true
    enable_machine_image_version_updates = true
    start                                = "01:00:00Z"
    end                                  = "02:00:00Z"
  }
}

resource "stackit_ske_kubeconfig" "example" {
  project_id   = var.stackit_project_id
  cluster_name = stackit_ske_cluster.example.name
  refresh      = true
}

locals {
  alert_config = {
    route = {
      receiver        = "EmailStackit",
      repeat_interval = "1m",
      continue        = true
    }
    receivers = [
      {
        name = "EmailStackit",
        email_configs = [
          {
            to = "<your-email>"
          },
        ]
      }
    ]
  }
}

resource "stackit_observability_instance" "example" {
  project_id = var.stackit_project_id
  name       = "example-instance"
  plan_name  = "Observability-Large-EU01"
  alert_config = local.alert_config
}

resource "stackit_observability_credential" "example" {
  instance_id = stackit_observability_instance.example.instance_id
  project_id  = var.stackit_project_id
}

resource "stackit_observability_logalertgroup" "example" {
  project_id  = var.stackit_project_id
  instance_id = stackit_observability_instance.example.instance_id
  name        = "TestLogAlertGroup"
  interval    = "1m"
  rules = [
    {
      alert      = "SimplePodLogAlertCheck"
      expression = "sum(rate({namespace=\"example\", pod=\"logger\"} |= \"Simulated error message\" [1m])) > 0"
      for        = "60s"
      labels = {
        severity = "critical"
      },
      annotations = {
        summary : "Test Log Alert is working"
        description : "Test Log Alert"
      },
    },
  ]
}

resource "kubernetes_namespace" "monitoring" {
  metadata {
    name = "monitoring"
  }
}

resource "helm_release" "promtail" {
  name       = "promtail"
  repository = "https://grafana.github.io/helm-charts"
  chart      = "promtail"
  namespace  = kubernetes_namespace.monitoring.metadata.0.name
  version    = "6.16.4"

  values = [
    <<-EOF
    config:
      clients:
      - url: "https://${stackit_observability_credential.example.username}:${stackit_observability_credential.example.password}@logs.<instance-id-in-portal>.argus.eu01.stackit.cloud/instances/${stackit_observability_instance.example.instance_id}/loki/api/v1/push"
    EOF
  ]
}

data "stackit_observability_logalertgroup" "test" {
  project_id  = var.stackit_project_id
  instance_id = stackit_observability_instance.example.instance_id
  name        = stackit_observability_logalertgroup.example.name
}

resource "kubernetes_namespace" "example" {
  metadata {
    name = "example"
  }
}

resource "kubernetes_pod" "logger" {
  metadata {
    name      = "logger"
    namespace = kubernetes_namespace.example.metadata[0].name
    labels = {
      app = "logger"
    }
  }

  spec {
    container {
      name  = "logger"
      image = "bash"
      command = [
        "bash",
        "-c",
        <<EOF
        while true; do
          sleep $(shuf -i 1-3 -n 1)  # Random sleep between 1 and 3 seconds
          echo "ERROR: $(date) - Simulated error message $(shuf -i 1-100 -n 1)" 1>&2
        done
        EOF
      ]
    }
  }
}

Checklist

  • Issue was linked above
  • Code format was applied: make fmt
  • Examples were added / adjusted (see examples/ directory)
  • Docs are up-to-date: make generate-docs (will be checked by CI)
  • Unit tests got implemented or updated
  • Acceptance tests got implemented or updated (see e.g. here)
  • Unit tests are passing: make test (will be checked by CI)
  • No linter issues: make lint (will be checked by CI)

@h3adex h3adex force-pushed the feat/implement-log-alertgroups branch 3 times, most recently from dc534b7 to 2973d3a Compare April 17, 2025 13:51
@h3adex h3adex force-pushed the feat/implement-log-alertgroups branch from 2973d3a to a3350e7 Compare April 17, 2025 13:59
Copy link

This PR was marked as stale after 7 days of inactivity and will be closed after another 7 days of further inactivity. If this PR should be kept open, just add a comment, remove the stale label or push new commits to it.

@github-actions github-actions bot added the Stale label Apr 25, 2025
@h3adex
Copy link
Contributor Author

h3adex commented Apr 25, 2025

Bump. Waiting for Holiday Season to be over :-)

@rubenhoenle rubenhoenle removed the Stale label Apr 25, 2025
@bahkauv70 bahkauv70 merged commit 3c20b77 into stackitcloud:main Apr 29, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants