Description
System Info
HF-TGI server running on Kubernetes, I executed text-generation-launcher --env
inside the pod:
2023-07-12T12:58:48.739266Z INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.70.0
Commit sha: 31b36cca21fcd0e6b7db477a7545063e1b860156
Docker label: sha-31b36cc
nvidia-smi:
Wed Jul 12 12:58:48 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... On | 00000001:00:00.0 Off | 0 |
| N/A 35C P0 71W / 300W | 46936MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
2023-07-12T12:58:48.739312Z INFO text_generation_launcher: Args { model_id: "OpenAssistant/falcon-40b-sft-mix-1226", revision: None, sharded: None, num_shard: None, quantize: Some(Bitsandbytes), dtype: None, trust_remote_code: false, max_concurrent_requests: 512, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "production-hf-text-generation-inference-6594cb8f5d-z4mdf", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: Some("tempo.monitoring:4317"), cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: true }
Model being used:
{
"model_id": "OpenAssistant/falcon-40b-sft-mix-1226",
"model_sha": "9ac6b7846fabe144646213cf1c6ee048b88272a7",
"model_dtype": "torch.float16",
"model_device_type": "cuda",
"model_pipeline_tag": "text-generation",
"max_concurrent_requests": 512,
"max_best_of": 2,
"max_stop_sequences": 4,
"max_input_length": 1024,
"max_total_tokens": 2048,
"waiting_served_ratio": 1.2,
"max_batch_total_tokens": 16000,
"max_waiting_tokens": 20,
"validation_workers": 2,
"version": "0.9.1",
"sha": "31b36cca21fcd0e6b7db477a7545063e1b860156",
"docker_label": "sha-31b36cc"
}
Hardware used (GPUs, how many, on which cloud) (nvidia-smi): nvidia-smi see above, runs on an Azure Kubernetes Cluster VM spec: Standard_NC24ads_A100_v4
Deployment specificities (Kubernetes, EKS, AKS, any particular deployments): Runs on an AKS and is installed through a helm chart
The current version being used: 0.9.1
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
I installed the HF TGI server, Prometheus, Grafana, Loki, and Grafana Tempo on Kubernetes. The latter four are in namespace monitoring
and the HF TGI server is in namespace hf-tgi
. HF-TGI is created with the following environment variable set: OTLP_ENDPOINT: "tempo.monitoring:4317"
, i.e. it references the service tempo
in namespace monitoring
on port 4317
. Service is up and running.
So far this works fine, in Grafana under "Explore", I can select "Tempo", click on "Search" and Run Query. It then finds a lot of traces, mostly from target /Health, sometimes /Decode:
Now, when I go to "Explore", select "Loki", and then query the logs from the HF TGI pod, I can see the info messages like in stdout on the server itself. In the messages there is an entry in the JSON called "spans[0].trace_id". When I use the value from that field and search that in "Explore" -> "Tempo" -> TraceQL, I get an error message that the trace was not found:
failed to get trace with id: XXXX Status: 404 Not Found Body: trace not found
Expected behavior
My expected behavior would be: TraceIDs listed in the info messages on the server should point to a trace that exists.
However, I am new to tracing (and the Prometheus-Grafana-etc. stack) so my question is also if I am misconfiguring something here. I think it is a bug because I can see some traces but the TraceID from the info log cannot be found.