Summary
On an RTX 5070 (GB205) running the 595.71.05 open kernel modules on X11, the GPU's display path intermittently faults with NVRM Xid 32 and/or Xid 56 originating from desktop graphics clients (browser, gnome-shell). On 2026-06-04 the fault escalated into a complete, unrecoverable system freeze requiring a hard power-cycle (no clean shutdown, no kernel oops). The compute path (CUDA, via Ollama) was active throughout and never faulted.
This appears to be the same display-engine instability class as forum thread 365287 and issues #1132 / #1134, but with a different configuration and trigger that none of those cover:
Environment
|
|
| GPU |
NVIDIA GeForce RTX 5070 (GB205), VBIOS 98.05.36.00.83, PCI 0000:01:00.0 |
| Driver |
595.71.05 open kernel modules (Ubuntu nvidia-driver-595-open 595.71.05-0ubuntu0.24.04.1, DKMS) |
| OS |
Ubuntu 24.04.4 LTS |
| Kernel |
6.17.0-35-generic |
| CPU / Board |
AMD Ryzen 9 7950X / Gigabyte B650M AORUS ELITE AX, BIOS F21 |
| Display server |
X11 (Xorg + GNOME), nvidia_drm.modeset=1 |
| Resizable BAR |
Enabled — BAR0 64M, BAR1 16G, BAR3 32M |
| GSP firmware |
Enabled (NVreg_EnableGpuFirmware) |
| Power mgmt |
PreserveVideoMemoryAllocations=1; nvidia-persistenced active; nvidia-suspend/resume/hibernate.service all enabled |
| Concurrent compute |
Ollama (CUDA) serving embeddings continuously on the same GPU |
Symptom / cascade
The GPU drives both the desktop and a CUDA compute workload (Ollama). All faults originate from display/graphics clients; the compute workload never appears in any Xid.
Event A — 2026-06-04 — fatal (hard freeze)
09:33:49 kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=13307, name=brave, channel 0x00000004 intr0 00040000
09:33:49 kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=13307, name=brave, channel 0x00000004 intr0 00040000
09:33:50 brave: ERROR ...shared_context_state.cc:1317] SharedContextState context lost via EXT_robustness.
Reset status = GL_GUILTY_CONTEXT_RESET_KHR
09:33:50 brave: GPU process exited unexpectedly: exit_code=8704
... ~6 min of normal operation ...
09:39:59 kernel: NVRM: Xid (PCI:0000:01:00): 32, pid=9397, name=gnome-shell, channel 0x00000007 intr 02000000
09:40:04 gdm-x-session: (WW) NVIDIA: Wait for channel idle timed out.
09:40:11 gdm-x-session: (WW) NVIDIA: Wait for channel idle timed out.
09:40:52 <last log line; machine frozen — required hard power-cycle>
The first Xid 32 (browser channel) was survivable — Chromium reset its GPU context and continued. The second Xid 32, on gnome-shell's channel, was not: the X driver then blocked on "Wait for channel idle timed out" and the entire machine became unresponsive within ~50 s. The system journal ends mid-stream with no shutdown sequence, and /proc/sys/kernel/tainted shows only out-of-tree+unsigned module bits (no TAINT_DIE) — i.e. a hard hang, not a logged panic.
Event B — 2026-06-01 — non-fatal (recovered), same week
09:54:19 kernel: NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000007 00000404 ffffffff 00000004 00800000
09:54:19 kernel: NVRM: Xid (PCI:0000:01:00): 56, CMDre 00000007 00000000 00000000 00000001 00800000
09:54:24 gdm-x-session: (WW) NVIDIA: Wait for channel idle timed out.
Here the display engine threw Xid 56 (the CMDre 00000007 fingerprint matches forum 365287's fatal event exactly) plus the same channel-idle timeout, but the X driver recovered and the session stayed up. This suggests the same underlying display-engine fault can either recover (Xid 56, Event B) or wedge the whole machine (Xid 32 on the compositor channel, Event A).
Ruled out
- S3 suspend/resume corruption (the 365287 root cause):
nvidia-suspend/resume/hibernate are enabled and no suspend occurred before the crash — uptime was continuous under load.
- Thermal / power: no HW/SW thermal or power-brake slowdown; idle/light temps (~58 °C), 39 W of 250 W.
- CPU MCE: none.
- OOM: no oom-killer events.
- PCIe: no AER / corrected errors / link drops for
0000:01:00.
- VRAM/ECC: consumer card (ECC N/A); no remapped rows / channel-repair pending.
- Kernel panic: none recorded (taint = out-of-tree+unsigned module only).
What seems to trigger it
Sustained mixed load: the 5070 simultaneously drives an X11 desktop with many GPU-accelerated clients (Chromium-based browser, gnome-shell, Electron apps) and a continuous CUDA workload (Ollama embeddings). Faults appear under this concurrency without any suspend, gaming/Proton, or BAR-undersize condition.
Request
I can attach a full nvidia-bug-report.sh log and complete journalctl -b -1 from the crashed boot on request.
Summary
On an RTX 5070 (GB205) running the 595.71.05 open kernel modules on X11, the GPU's display path intermittently faults with
NVRM Xid 32and/orXid 56originating from desktop graphics clients (browser, gnome-shell). On 2026-06-04 the fault escalated into a complete, unrecoverable system freeze requiring a hard power-cycle (no clean shutdown, no kernel oops). The compute path (CUDA, via Ollama) was active throughout and never faulted.This appears to be the same display-engine instability class as forum thread 365287 and issues #1132 / #1134, but with a different configuration and trigger that none of those cover:
__nv_drm_gem_nvkms_mapcomposes a mapping that spans BAR1→BAR3, causingmapping_reuse.c:273 NV_ERR_NO_MEMORYandkrcWatchdogGPU lock — driver 595.71.05 (open kernel modules), Resizable BAR disabled #1132 / kernel 7.0.3 + nvidia-open 595.71.05 on RTX 3090:__nv_drm_gem_nvkms_maprequests range exceeding PCI BAR1 → Xid 31 → Xid 154 (Node Reboot Required) under Chromium GPU workload #1134: those are Resizable BAR disabled / BAR1 undersized (__nv_drm_gem_nvkms_maprange crossing the BAR1↔BAR3 boundary). Here Resizable BAR is enabled (BAR1 = 16 GiB), so the BAR-overflow mechanism does not apply.Xid 56, CMDre 00000007fingerprint, but that report's root cause was GPU state lost across an S3 resume withnvidia-suspend/nvidia-resumedisabled. Here those services are enabled and no suspend/resume occurred before the crash — the machine was awake under continuous load.Environment
98.05.36.00.83, PCI0000:01:00.0nvidia-driver-595-open595.71.05-0ubuntu0.24.04.1, DKMS)nvidia_drm.modeset=1NVreg_EnableGpuFirmware)PreserveVideoMemoryAllocations=1;nvidia-persistencedactive;nvidia-suspend/resume/hibernate.serviceall enabledSymptom / cascade
The GPU drives both the desktop and a CUDA compute workload (Ollama). All faults originate from display/graphics clients; the compute workload never appears in any Xid.
Event A — 2026-06-04 — fatal (hard freeze)
The first Xid 32 (browser channel) was survivable — Chromium reset its GPU context and continued. The second Xid 32, on gnome-shell's channel, was not: the X driver then blocked on "Wait for channel idle timed out" and the entire machine became unresponsive within ~50 s. The system journal ends mid-stream with no shutdown sequence, and
/proc/sys/kernel/taintedshows only out-of-tree+unsigned module bits (noTAINT_DIE) — i.e. a hard hang, not a logged panic.Event B — 2026-06-01 — non-fatal (recovered), same week
Here the display engine threw
Xid 56(theCMDre 00000007fingerprint matches forum 365287's fatal event exactly) plus the same channel-idle timeout, but the X driver recovered and the session stayed up. This suggests the same underlying display-engine fault can either recover (Xid 56, Event B) or wedge the whole machine (Xid 32 on the compositor channel, Event A).Ruled out
nvidia-suspend/resume/hibernateare enabled and no suspend occurred before the crash — uptime was continuous under load.0000:01:00.What seems to trigger it
Sustained mixed load: the 5070 simultaneously drives an X11 desktop with many GPU-accelerated clients (Chromium-based browser, gnome-shell, Electron apps) and a continuous CUDA workload (Ollama embeddings). Faults appear under this concurrency without any suspend, gaming/Proton, or BAR-undersize condition.
Request
Xid 32 → Xid 56 (CMDre 00000007) → channel-idle timeoutdisplay-engine failure on GB205 / 595.71.05 is tracked, and whether it is distinct from the BAR-undersize (RTX 5070 (GB205):__nv_drm_gem_nvkms_mapcomposes a mapping that spans BAR1→BAR3, causingmapping_reuse.c:273 NV_ERR_NO_MEMORYandkrcWatchdogGPU lock — driver 595.71.05 (open kernel modules), Resizable BAR disabled #1132/kernel 7.0.3 + nvidia-open 595.71.05 on RTX 3090:__nv_drm_gem_nvkms_maprequests range exceeding PCI BAR1 → Xid 31 → Xid 154 (Node Reboot Required) under Chromium GPU workload #1134) and S3-resume (forum 365287) cases given Resizable BAR is enabled and no suspend is involved here.nvidia-bug-report.shcan't run post-hang — happy to capture it live at first Xid if there's a recommended trigger).I can attach a full
nvidia-bug-report.shlog and completejournalctl -b -1from the crashed boot on request.