-
Notifications
You must be signed in to change notification settings - Fork 541
Open
Labels
area:collectorIssues for deploying collectorIssues for deploying collectorarea:target-allocatorIssues for target-allocatorIssues for target-allocatorbugSomething isn't workingSomething isn't working
Description
Component(s)
target allocator
What happened?
Description
When metrics enabled on target allocator Collector comes up in a partially initialized state, causing issues such as hpa to not initialize depending on features enabled
Steps to Reproduce
Deploy Prometheus Operator CRDs for ServiceMonitor and PodMonitor
Deploy OpenTelemetry Operator with following specs
helm upgrade --install opentelemetry-operator open-telemetry/opentelemetry-operator --atomic --timeout 1800s --version 0.74.2 --create-namespace -n otel-operator --set manager.collectorImage.repository=otel/opentelemetry-collector-contrib --set manager.createRbacPermissions=true
Create an OpenTelemetry collector with a TargetAllocator
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel
namespace: monitoring
spec:
config:
processors:
batch: {}
tail_sampling:
policies:
- name: drop_noisy_traces_url
type: string_attribute
string_attribute:
key: http.url
values:
- \/metrics
- \/health
- \/livez
- \/readyz
- \/prometheus
- \/actuator*
- opentelemetry\.proto
- favicon\.ico
enabled_regex_matching: true
invert_match: true
k8sattributes:
extract:
annotations:
- from: pod
key: splunk.com/sourcetype
- from: namespace
key: splunk.com/exclude
tag_name: splunk.com/exclude
- from: pod
key: splunk.com/exclude
tag_name: splunk.com/exclude
- from: namespace
key: splunk.com/index
tag_name: com.splunk.index
- from: pod
key: splunk.com/index
tag_name: com.splunk.index
labels:
- key: app
metadata:
- k8s.namespace.name
- k8s.node.name
- k8s.pod.name
- k8s.pod.uid
- container.id
- container.image.name
- container.image.tag
filter:
node_from_env_var: K8S_NODE_NAME
pod_association:
- sources:
- from: resource_attribute
name: k8s.pod.uid
- sources:
- from: resource_attribute
name: k8s.pod.ip
- sources:
- from: resource_attribute
name: ip
- sources:
- from: connection
- sources:
- from: resource_attribute
name: host.name
memory_limiter:
check_interval: 5s
limit_percentage: 90
resource:
attributes:
- action: upsert
key: gke_cluster
value: ${CLUSTER_NAME}
- action: upsert
key: cluster_name
value: staging-digital
- key: cluster
value: ${CLUSTER_NAME}
action: upsert
resourcedetection:
detectors:
- env
- gcp
- system
override: true
timeout: 10s
extensions:
health_check:
endpoint: ${MY_POD_IP}:13133
k8s_observer:
auth_type: serviceAccount
node: ${K8S_NODE_NAME}
receivers:
prometheus:
config:
global:
evaluation_interval: 15s
scrape_interval: 30s
scrape_timeout: 10s
scrape_configs:
- job_name: kubernetes-apiservers
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: default;kubernetes;https
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_service_name
- __meta_kubernetes_endpoint_port_name
metric_relabel_configs:
- source_labels: [__name__]
regex: etcd_request_duration_seconds_bucket
action: drop
- source_labels: [__name__]
regex: apiserver_(response_sizes_bucket|request_duration_seconds_bucket|request_slo_duration_seconds_bucket)
action: drop
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- job_name: kubernetes-nodes
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$$1/proxy/metrics
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
metric_relabel_configs:
- source_labels: [__name__]
regex: storage_operation_duration_seconds_bucket|node_filesystem_device_error|
action: drop
- source_labels: [__name__]
regex: kubelet_(runtime_operations_duration_seconds_bucket|http_requests_duration_seconds_bucket)
action: drop
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $$1:$$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: pod
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: node
- action: drop
regex: Pending|Succeeded|Failed
source_labels:
- __meta_kubernetes_pod_phase
metric_relabel_configs:
- action: drop
source_labels: [__name__]
regex: istio_agent_.*|istiod_.*|istio_build|citadel_.*|galley_.*|pilot_[^p].*|envoy_cluster_[^u].*|envoy_cluster_update.*|envoy_listener_[^dh].*|envoy_server_[^mu].*|envoy_wasm_.*
- action: labeldrop
regex: chart|destination_app|destination_version|heritage|.*operator.*|istio.*|release|security_istio_io_.*|service_istio_io_.*|sidecar_istio_io_inject|source_app|source_version
- source_labels: [__name__]
regex: coredns_dns_(request|response)_(size_bytes_bucket|duration_seconds_bucket)
action: drop
- source_labels: [__name__]
regex: hystrix_latency_Execute
action: drop
- action: labeldrop
regex: source_principal|source_version|source_cluster|pod_template_hash|destination_cluster|destination_principal
scrape_interval: 30s
scrape_timeout: 5s
- job_name: cockroach-stats
kubernetes_sd_configs:
- role: endpoints
selectors:
- role: endpoints
label: "app.kubernetes.io/component=cockroachdb"
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $$1:$$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: namespace
- action: replace
source_labels:
- __meta_kubernetes_service_name
target_label: service_name
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: node
metric_relabel_configs:
- source_labels: [__name__]
regex: raft_.*
action: drop
- source_labels: [__name__]
regex: sql_(mem_distsql_max_bucket|stats_txn_stats_collection_duration_bucket|exec_latency_internal_bucket|txn_latency_bucket|service_latency_internal_bucket)
action: drop
- source_labels: [__name__]
regex: sql_(txn_latency_internal_bucket|service_latency_bucket|distsql_exec_latency_bucket|distsql_service_latency_bucketstats_flush_duration_bucket|mem_sql_txn_max_bucket)
action: drop
- source_labels: [__name__]
regex: exec_latency_bucket|txn_durations_bucket|liveness_heartbeatlatency_bucket|admission_wait_durations_sql_sql_response_bucket|changefeed_flush_hist_nanos_bucket|admission_wait_durations_sql_kv_response_bucket
action: drop
- job_name: kube-state-metrics
kubernetes_sd_configs:
- role: endpoints
selectors:
- role: endpoints
label: "app.kubernetes.io/name=kube-state-metrics"
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scrape
- action: replace
regex: (https?)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_scheme
target_label: __scheme__
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_service_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $$1:$$2
source_labels:
- __address__
- __meta_kubernetes_service_annotation_prometheus_io_port
target_label: __address__
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: exporter_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_node_name
target_label: node
metric_relabel_configs:
- source_labels: [__name__]
regex: kube_pod_status_(reason|scheduled|ready)
action: drop
- regex: exporter_namespace
action: labeldrop
- job_name: kubernetes-nodes-cadvisor
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
honor_timestamps: true
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/$$1/proxy/metrics/cadvisor
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
metric_relabel_configs:
- source_labels: [__name__]
regex: container_(tasks_state|blkio_device_usage_total|file_descriptors|sockets|threads|threads_max|processes|spec_cpu_shares|start_time_seconds|spec_memory_limit_bytes|ulimits_soft)
action: drop
- source_labels: [__name__]
regex: container_spec_memory_(reservation_limit_bytes|swap_limit_bytes|limit_bytes)
action: drop
- source_labels: [__name__]
regex: container_memory_(failures_total|mapped_file|failcnt|cache|rss)
action: drop
otlp:
protocols:
grpc:
endpoint: ${MY_POD_IP}:4317
keepalive:
enforcement_policy:
min_time: 5s
permit_without_stream: true
server_parameters:
time: 5s
timeout: 10s
http:
endpoint: ${MY_POD_IP}:4318
zipkin:
endpoint: ${MY_POD_IP}:9411
exporters:
prometheusremotewrite:
endpoint: https://mimir/api/v1/push
retry_on_failure:
enabled: true
initial_interval: 1s
max_interval: 10s
max_elapsed_time: 30s
otlp:
endpoint: otlp:4317
tls:
insecure: true
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000
service:
telemetry:
metrics:
address: "${MY_POD_IP}:8888"
level: basic
logs:
level: "warn"
extensions:
- health_check
- k8s_observer
pipelines:
traces:
receivers:
- otlp
- zipkin
processors:
- memory_limiter
- resourcedetection
- resource
- k8sattributes
- tail_sampling
- batch
exporters:
- otlp
metrics:
receivers:
- prometheus
- otlp
processors:
- memory_limiter
- batch
exporters:
- prometheusremotewrite
env:
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: K8S_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: K8S_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: K8S_POD_UID
valueFrom:
fieldRef:
fieldPath: metadata.uid
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: CLUSTER_NAME
value: "staging-tools"
mode: statefulset
podAnnotations:
sidecar.istio.io/inject: "false"
prometheus.io/scrape: "true"
promethios.io/port: "8888"
priorityClassName: highest-priority
autoscaler:
behavior:
scaleUp:
stabilizationWindowSeconds: 30
maxReplicas: 10
minReplicas: 3
targetCPUUtilization: 70
targetMemoryUtilization: 70
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 300m
memory: 600Mi
targetAllocator:
allocationStrategy: consistent-hashing
enabled: true
filterStrategy: relabel-config
observability:
metrics:
enableMetrics: true
prometheusCR:
enabled: true
scrapeInterval: 30s
replicas: 2
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 300m
memory: 300Mi
Expected Result
Collector initialized with HPA and minimum of 3 replicas
Actual Result
Collector never fully comes up and reports status, scale never happens, error received stating ServiceMonitor could not be created
Kubernetes Version
1.30.5
Operator version
0.74.2
Collector version
0.113.0
Environment information
Environment
GKE
Log output
From a describe on the collector object
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Error 15m (x35 over 109m) opentelemetry-operator failed to create objects for otel: no kind is registered for the type v1.ServiceMonitor in scheme "pkg/runtime/scheme.go:100"
From the operator logs
{"level":"ERROR","timestamp":"2024-10-02T19:50:31Z","message":"Reconciler error","controller":"opentelemetrycollector","controllerGroup":"opentelemetry.io","controllerKind":"OpenTelemetryCollector","OpenTelemetryCollector":{"name":"otel-prometheus","namespace":"monitoring-system"},"namespace":"monitoring-system","name":"otel-prometheus","reconcileID":"8db2cbd5-c12f-453d-9cec-56abfec51c0c","error":"failed to create objects for otel-prometheus: no kind is registered for the type v1.ServiceMonitor in scheme "pkg/runtime/scheme.go:100"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222"}
### Additional context
Disabling metrics fixed the issue and allowed the Operator to complete setting up the Collector and TargetAllocator properly, thought I would open a ticket with the isolated error, I've turned off metrics for now
targetAllocator:
observability:
metrics:
enableMetrics: false
Metadata
Metadata
Assignees
Labels
area:collectorIssues for deploying collectorIssues for deploying collectorarea:target-allocatorIssues for target-allocatorIssues for target-allocatorbugSomething isn't workingSomething isn't working