Skip to content

Full system hard-lock on BAR1 VA-space exhaustion (Turing, 256 MiB BAR1) from a browser WebGL workload — RC watchdog "GPU is probably locked", 595.71.05 #1187

@Virgil-Bulens

Description

@Virgil-Bulens

Summary

A userspace GL/WebGL client (Chromium-based browser rendering a WebGL-heavy page) can exhaust the GPU's 256 MiB BAR1 aperture, after which the driver emits a continuous flood of dmaAllocMapping_GM107: can't alloc VA space for mapping / NV_ERR_NO_MEMORY, then krcWatchdog: GPU is probably locked!, and the entire machine hard-locks — no clean shutdown, no SysRq, requires a power cycle.

Expected behavior: BAR1/VA-space exhaustion should surface to the client as an allocation failure (the renderer/tab dies — which it does on the first occurrence), without locking the GPU engine or hanging the host kernel. Observed behavior: on a repeat of the workload the driver fails to contain the exhaustion and the GPU/host deadlock.

This is reproducible and not load-spike related — it builds over a few minutes while the page is open.

Environment

  • GPU: NVIDIA GeForce RTX 2070 SUPER (Turing, TU104)
  • VBIOS: 90.04.95.00.58
  • BAR1: 256 MiB (Resizable BAR not supported on Turing, so this is fixed)
  • Driver: 595.71.05 (open kernel modules). Also reproduced on 580.159.03.
  • Kernel: 6.17.0-35-generic, Ubuntu 24.04 (x86_64)
  • CPU/board: AMD Ryzen 7 3700X, Gigabyte (AMD platform)

Reproduction

  1. Open a WebGL/canvas-heavy site (in our case ui.com / UniFi UI) in a GPU-accelerated Chromium-based browser.
  2. Leave it rendering for ~1–7 minutes.
  3. BAR1 VA-space exhausts; kernel log fills with can't alloc VA space; renderer crashes once ("Aw, Snap").
  4. Reload the page → GPU RC watchdog reports the GPU locked → full system hang (hard reset required).

Reproduced 3/3 times. Disabling browser GPU acceleration avoids it (confirms the BAR1 mapping path as the trigger).

Key kernel log sequence (excerpt)

NVRM: dmaAllocMapping_GM107: can't alloc VA space for mapping.        (×hundreds, in bursts)
NVRM: nvAssertOkFailedNoLog: Assertion failed: Out of memory [NV_ERR_NO_MEMORY] (0x00000051)
      ... @ mapping_reuse.c:273
NVRM: ... @ kern_bus_gm107.c:3141
[drm] [nvidia-drm] [GPU ID 0x00000700] Failed to ioremap_wc NvKmsKapiMemory ...
[drm:__nv_drm_gem_nvkms_map [nvidia_drm]] *ERROR* Failed to map NvKmsKapiMemory ...
NVRM: krcWatchdog_IMPL: RC watchdog: GPU is probably locked!  Notify Timeout Seconds: 7
NVRM: nvAssertFailedNoLog: Assertion failed: GPPut < WATCHDOG_GPFIFO_ENTRIES @ kernel_rc_watchdog.c:1549

(Full curated kernel sequence and nvidia-bug-report.log.gz available on request / attached.)

Notes / impact

  • The single-renderer-crash path works (allocation failure is returned). The escalation to a GPU engine lock + unrecoverable host hang on repeat is the bug.
  • On a small-BAR1 (256 MiB) Turing part with no Resizable BAR, this aperture is easy for a modern WebGL workload to exhaust, so robust handling of BAR1 exhaustion matters here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions