- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.2k
Description
System Info
HF-TGI server running on Kubernetes, I executed text-generation-launcher --env inside the pod:
2023-07-12T12:58:48.739266Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.70.0
Commit sha: 31b36cca21fcd0e6b7db477a7545063e1b860156
Docker label: sha-31b36cc
nvidia-smi:
Wed Jul 12 12:58:48 2023       
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA A100 80G...  On   | 00000001:00:00.0 Off |                    0 |
   | N/A   35C    P0    71W / 300W |  46936MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
                                                                                  
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   +-----------------------------------------------------------------------------+
2023-07-12T12:58:48.739312Z  INFO text_generation_launcher: Args { model_id: "OpenAssistant/falcon-40b-sft-mix-1226", revision: None, sharded: None, num_shard: None, quantize: Some(Bitsandbytes), dtype: None, trust_remote_code: false, max_concurrent_requests: 512, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "production-hf-text-generation-inference-6594cb8f5d-z4mdf", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: Some("tempo.monitoring:4317"), cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: true }
Model being used:
{
  "model_id": "OpenAssistant/falcon-40b-sft-mix-1226",
  "model_sha": "9ac6b7846fabe144646213cf1c6ee048b88272a7",
  "model_dtype": "torch.float16",
  "model_device_type": "cuda",
  "model_pipeline_tag": "text-generation",
  "max_concurrent_requests": 512,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 1024,
  "max_total_tokens": 2048,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 16000,
  "max_waiting_tokens": 20,
  "validation_workers": 2,
  "version": "0.9.1",
  "sha": "31b36cca21fcd0e6b7db477a7545063e1b860156",
  "docker_label": "sha-31b36cc"
}Hardware used (GPUs, how many, on which cloud) (nvidia-smi): nvidia-smi see above, runs on an Azure Kubernetes Cluster VM spec: Standard_NC24ads_A100_v4
Deployment specificities (Kubernetes, EKS, AKS, any particular deployments): Runs on an AKS and is installed through a helm chart
The current version being used: 0.9.1
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
I installed the HF TGI server, Prometheus, Grafana, Loki, and Grafana Tempo on Kubernetes. The latter four are in namespace monitoring and the HF TGI server is in namespace hf-tgi. HF-TGI is created with the following environment variable set: OTLP_ENDPOINT: "tempo.monitoring:4317", i.e. it references the service tempo in namespace monitoring on port 4317. Service is up and running.
So far this works fine, in Grafana under "Explore", I can select "Tempo", click on "Search" and Run Query. It then finds a lot of traces, mostly from target /Health, sometimes /Decode:

Now, when I go to "Explore", select "Loki", and then query the logs from the HF TGI pod, I can see the info messages like in stdout on the server itself. In the messages there is an entry in the JSON called "spans[0].trace_id". When I use the value from that field and search that in "Explore" -> "Tempo" -> TraceQL, I get an error message that the trace was not found:
failed to get trace with id: XXXX Status: 404 Not Found Body: trace not found
Expected behavior
My expected behavior would be: TraceIDs listed in the info messages on the server should point to a trace that exists.
However, I am new to tracing (and the Prometheus-Grafana-etc. stack) so my question is also if I am misconfiguring something here. I think it is a bug because I can see some traces but the TraceID from the info log cannot be found.
