Skip to content

Retry NVIDIA containers without display capability#4006

Open
peterschmidt85 wants to merge 1 commit into
masterfrom
codex/nvidia-modeset-fallback
Open

Retry NVIDIA containers without display capability#4006
peterschmidt85 wants to merge 1 commit into
masterfrom
codex/nvidia-modeset-fallback

Conversation

@peterschmidt85

@peterschmidt85 peterschmidt85 commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Fixes #4004.

Currently, dstack-shim asks Docker for this NVIDIA container.DeviceRequest.Capabilities set:

gpu,utility,compute,graphics,video,display,compat32

This is wrong on headless NVIDIA hosts where CUDA works but /dev/nvidia-modeset is absent. Requesting display makes NVIDIA Container Runtime fail before the user command starts:

nvidia-container-cli: mount error: stat failed: /dev/nvidia-modeset: no such file or directory

The fix keeps the current full capability set on the first start attempt. If an NVIDIA container fails specifically because /dev/nvidia-modeset is missing, the shim removes the failed container and retries without only display:

gpu,utility,compute,graphics,video,compat32

NVIDIA documents display as X11 display support. The documented default driver capabilities are utility,compute: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/docker-specialized.html

The fallback applies only to the normal NVIDIA Docker DeviceRequest path. AMD, Tenstorrent, Intel, and explicit GPUDevices handling are unchanged.

AI Assistance: This PR was prepared with AI assistance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: NVIDIA tasks fail on headless hosts without /dev/nvidia-modeset

1 participant