Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Problem] Unable to update grpc_server_max_recv_msg_size and grpc_server_max_send_msg_size in server stanza setting in TempoStack #4610

Open
vsomwanshi opened this issue Jan 24, 2025 · 0 comments

Comments

@vsomwanshi
Copy link

We have Tempo Operator version 0.14.1-2 provided by Red Hat installed in our environment and we have created a TempoStack instance using below configuration.

apiVersion: tempo.grafana.com/v1alpha1
kind: TempoStack
metadata:
  name: tempostack
  namespace: tempo
spec:
  observability:
    grafana:
      instanceSelector: {}
    metrics: {}
    tracing:
      jaeger_agent_endpoint: 'localhost:6831'
  timeout: 30s
  resources:
    total:
      limits:
        cpu: '6'
        memory: 15Gi
  search:
    defaultResultLimit: 20
    maxDuration: 0s
  managementState: Managed
  limits:
    global:
      ingestion:
        maxBytesPerTrace: 0
      query:
        maxSearchDuration: 0s
  serviceAccount: dev-tempostack
  images: {}
  template:
    compactor:
      replicas: 3
    distributor:
      component:
        replicas: 1
      tls:
        enabled: false
    gateway:
      component:
        replicas: 1
      enabled: false
      ingress:
        route: {}
    ingester:
      replicas: 1
    querier:
      replicas: 2
    queryFrontend:
      component:
        replicas: 1
      jaegerQuery:
        authentication:
          enabled: true
          sar: '{"namespace": "gtempo", "resource": "pods", "verb": "get"}'
        enabled: true
        ingress:
          route:
            termination: edge
          type: route
        monitorTab:
          enabled: false
          prometheusEndpoint: ''
        servicesQueryDuration: 72h0m0s
        tempoQuery: {}
  replicationFactor: 1
  storage:
    secret:
      name: grafana-tempo-cos
      type: s3
    tls:
      enabled: false
  storageSize: 100Gi
  hashRing:
    memberlist: {}
  retention:
    global:
      traces: 12h0m0s

With above configuration, it has deployed the TempoStack and other dependent component.

Here, i'm looking for 2 additional things as below:

  1. Looking at above configuration, resources limits and quota is setup globally.
resources:
    total:
      limits:
        cpu: '6'
        memory: 15Gi

whenever we are trying to set resource quota for individually for each component e.g compactor i.e specs.template.compactor.resources.limits it is getting overwritten and due to which compactor pod is not getting enough compute resources and restarting with CrashloopBackoff.

  1. We have noticed issues with the Grafana Tempo instance where some trace spans appeared to be missing (and were never linked to the parent trace).

Looking deeper, we noticed the following set of errors in the pods of the Tempo Stack instance:

Distributor pod:

level=error ts=2025-01-21T14:52:49.384256785Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="rpc error: code = FailedPrecondition desc = TRACE_TOO_LARGE: max size of trace (5000000) exceeded while adding 142414 bytes to trace 1b4722a6b98fd1da98f0370af4089927 for tenant single-tenant"

Ingestor pod:

level=warn ts=2025-01-21T20:28:55.532118428Z caller=server.go:1184 method=/tempopb.Pusher/PushBytesV2 duration=80.163µs msg=gRPC err="rpc error: code = FailedPrecondition desc = TRACE_TOO_LARGE: max size of trace (5000000) exceeded while adding 177635 bytes to trace c5ca65779266a32c8dd98a5cd54357f8 for tenant single-tenant"

Querier pod::

ResourceExhausted desc = grpc: received message after decompression larger than max (4993760 vs. 4194304)

From some research, it seems like we need to bump up the maximum trace size. By default, that is set to 5000000 (5 MiB).
As per here (https://grafana.com/docs/tempo/latest/configuration/#ingestion-limits), "overrides" can be used to increase this (there is caution against going to large, however). we have added parameter as below.

overrides:
  defaults:
    global:
      max_bytes_per_trace: 50000000

The issue with the querier pod seems to be due to there being a gRPC message size limit between TempoStack components. As suggested here (#1097), I think we need to change settings both in the tempo.yaml and tempo-query-frontend.yaml to increase these to at least the max_bytes_per_trace size.

we are trying to change the existing server stanza settings with below values however it is getting overwritten,

Can anyone help us here and suggest how to and where to change these settings.

Thanks in advance !!

@vsomwanshi vsomwanshi changed the title unable to set grpc_server_max_recv_msg_size and grpc_server_max_send_msg_size in unable to update grpc_server_max_recv_msg_size and grpc_server_max_send_msg_size in server stanza setting in TempoStack Jan 24, 2025
@vsomwanshi vsomwanshi changed the title unable to update grpc_server_max_recv_msg_size and grpc_server_max_send_msg_size in server stanza setting in TempoStack [Problem] Unable to update grpc_server_max_recv_msg_size and grpc_server_max_send_msg_size in server stanza setting in TempoStack Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant