Skip to content

RTX 5070 (GB205): Xid 32 → Xid 56 → "Wait for channel idle timed out" → full hard-freeze under sustained display+compute load — 595.71.05 open module, X11, Resizable BAR enabled #1179

@lucianoengel

Description

@lucianoengel

Summary

On an RTX 5070 (GB205) running the 595.71.05 open kernel modules on X11, the GPU's display path intermittently faults with NVRM Xid 32 and/or Xid 56 originating from desktop graphics clients (browser, gnome-shell). On 2026-06-04 the fault escalated into a complete, unrecoverable system freeze requiring a hard power-cycle (no clean shutdown, no kernel oops). The compute path (CUDA, via Ollama) was active throughout and never faulted.

This appears to be the same display-engine instability class as forum thread 365287 and issues #1132 / #1134, but with a different configuration and trigger that none of those cover:

Environment

GPU NVIDIA GeForce RTX 5070 (GB205), VBIOS 98.05.36.00.83, PCI 0000:01:00.0
Driver 595.71.05 open kernel modules (Ubuntu nvidia-driver-595-open 595.71.05-0ubuntu0.24.04.1, DKMS)
OS Ubuntu 24.04.4 LTS
Kernel 6.17.0-35-generic
CPU / Board AMD Ryzen 9 7950X / Gigabyte B650M AORUS ELITE AX, BIOS F21
Display server X11 (Xorg + GNOME), nvidia_drm.modeset=1
Resizable BAR Enabled — BAR0 64M, BAR1 16G, BAR3 32M
GSP firmware Enabled (NVreg_EnableGpuFirmware)
Power mgmt PreserveVideoMemoryAllocations=1; nvidia-persistenced active; nvidia-suspend/resume/hibernate.service all enabled
Concurrent compute Ollama (CUDA) serving embeddings continuously on the same GPU

Symptom / cascade

The GPU drives both the desktop and a CUDA compute workload (Ollama). All faults originate from display/graphics clients; the compute workload never appears in any Xid.

Event A — 2026-06-04 — fatal (hard freeze)

09:33:49  kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=13307, name=brave, channel 0x00000004 intr0 00040000
09:33:49  kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=13307, name=brave, channel 0x00000004 intr0 00040000
09:33:50  brave: ERROR ...shared_context_state.cc:1317] SharedContextState context lost via EXT_robustness.
                 Reset status = GL_GUILTY_CONTEXT_RESET_KHR
09:33:50  brave: GPU process exited unexpectedly: exit_code=8704
          ... ~6 min of normal operation ...
09:39:59  kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=9397, name=gnome-shell, channel 0x00000007 intr 02000000
09:40:04  gdm-x-session: (WW) NVIDIA: Wait for channel idle timed out.
09:40:11  gdm-x-session: (WW) NVIDIA: Wait for channel idle timed out.
09:40:52  <last log line; machine frozen — required hard power-cycle>

The first Xid 32 (browser channel) was survivable — Chromium reset its GPU context and continued. The second Xid 32, on gnome-shell's channel, was not: the X driver then blocked on "Wait for channel idle timed out" and the entire machine became unresponsive within ~50 s. The system journal ends mid-stream with no shutdown sequence, and /proc/sys/kernel/tainted shows only out-of-tree+unsigned module bits (no TAINT_DIE) — i.e. a hard hang, not a logged panic.

Event B — 2026-06-01 — non-fatal (recovered), same week

09:54:19  kernel: NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000007 00000404 ffffffff 00000004 00800000
09:54:19  kernel: NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000007 00000000 00000000 00000001 00800000
09:54:24  gdm-x-session: (WW) NVIDIA: Wait for channel idle timed out.

Here the display engine threw Xid 56 (the CMDre 00000007 fingerprint matches forum 365287's fatal event exactly) plus the same channel-idle timeout, but the X driver recovered and the session stayed up. This suggests the same underlying display-engine fault can either recover (Xid 56, Event B) or wedge the whole machine (Xid 32 on the compositor channel, Event A).

Ruled out

  • S3 suspend/resume corruption (the 365287 root cause): nvidia-suspend/resume/hibernate are enabled and no suspend occurred before the crash — uptime was continuous under load.
  • Thermal / power: no HW/SW thermal or power-brake slowdown; idle/light temps (~58 °C), 39 W of 250 W.
  • CPU MCE: none.
  • OOM: no oom-killer events.
  • PCIe: no AER / corrected errors / link drops for 0000:01:00.
  • VRAM/ECC: consumer card (ECC N/A); no remapped rows / channel-repair pending.
  • Kernel panic: none recorded (taint = out-of-tree+unsigned module only).

What seems to trigger it

Sustained mixed load: the 5070 simultaneously drives an X11 desktop with many GPU-accelerated clients (Chromium-based browser, gnome-shell, Electron apps) and a continuous CUDA workload (Ollama embeddings). Faults appear under this concurrency without any suspend, gaming/Proton, or BAR-undersize condition.

Request

I can attach a full nvidia-bug-report.sh log and complete journalctl -b -1 from the crashed boot on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions