Allow nvproxy to be enabled on platform/kvm

### Description

Per https://github.com/google/gvisor/blob/master/g3doc/proposals/nvidia_driver_proxy.md, nvproxy is currently incompatible with platform/kvm for two reasons:

1. platform/kvm configures all guest page table entries to enable write-back caching, which is not generally correct for device mappings. On Intel CPUs, KVM mostly mitigates this by mapping uncacheable and write-combining pages as UC in EPT[^1]; the hardware takes roughly the intersection of the EPT and guest page table memory type[^2]. On AMD CPUs, the hardware has the same capability[^3], but KVM does not configure it[^4], so guest page table entires must set memory type correctly to obtain correct behavior. I don't think plumbing memory type through the sentry should be too difficult, but matching the kernel driver's memory types will be tricky; we will probably have to approximate conservatively in some cases, which may degrade performance.

2. Mappings of any given file offset in `/dev/nvidia-uvm` must be at the matching virtual address. In KVM, providing mappings of `/dev/nvidia-uvm` to the guest requires mapping it into the KVM-using process' (the sentry's) address space and then forwarding it into the guest using a KVM memslot. Thus any UVM mapping at an offset that conflicts with an existing sentry mapping is unimplementable. AFAIK, this only affects CUDA - Vulkan and NVENC/DEC do not use UVM - so this should no longer block use of nvproxy with platform/kvm in any case. But it is still a problem for CUDA.

Experimentally:
- CUDA-using binaries unconditionally map 2 MB of `/dev/nvidia-uvm` at a fixed address (in `cuda_malloc_test` this happens to be 0x205000000, but I'm unsure if this is consistent between binaries). This happens to work (at least in my minimal testing) if `nvproxy.uvmFDMemmapFile.MapInternal()` is changed to attempt a mapping using `MAP_FIXED_NOREPLACE`.
- `cudaMallocManaged()` reserves application address space using `mmap(MAP_PRIVATE|MAP_ANONYMOUS)` and then overwrites the reservation mapping with a `MAP_FIXED` mapping of `/dev/nvidia-uvm`, which *sometimes* collides with an existing sentry mapping depending on ASLR for both the sentry and application.

Options:
- Keep the existing implementation of `uvmFDMemmapFile.MapInternal()`, which will unconditionally cause CUDA binaries to fail on platform/kvm.
- Try to use `MAP_FIXED_NOREPLACE` for the sentry mapping. This should cause CUDA binaries that don't use `cudaMallocManaged()` to succeed (good), but will cause CUDA binaries that do to flake (bad). This is probably unacceptable for use cases that run arbitrary GPU workloads, but might be fine for others; AFAIK use of `cudaMallocManaged()` is uncommon for performance reasons.
- Use `MAP_FIXED_NOREPLACE`, and use a custom ELF loader to load the sentry in an address range that is less likely to collide with application addresses. I'm not sure this would actually work since IIUC the interfering sentry mappings are from runtime mmaps.
- Use `MAP_FIXED_NOREPLACE`, and also try to avoid returning application mappings that would collide with existing sentry mappings. This is racy and leaks information about the sentry to applications, so it's probably undesirable.
- Fail at startup if platform/kvm is in use, nvproxy is enabled, and `NVIDIA_DRIVER_CAPABILITIES` contains `compute`. I mention this option mostly to rule it out; some containers in practice specify `NVIDIA_DRIVER_CAPABILITIES=all` even if only e.g. graphics support is required, and in fact NVIDIA's Vulkan support requires `libnvidia-gpucomp.so` which `libnvidia-container` only mounts when `--compute` is specified[^5] so this is necessary!

[^1]: Linux: `arch/x86/kvm/mmu/spte.c:make_spte()` => `kvm_is_mmio_pfn()`, `arch/x86/kvm/vmx/vmx.c:vmx_get_mt_mask()`
[^2]: Intel 64 and IA-32 Software Developer Manual, Vol. 3, Sec. 30.3.7.2 "Memory Type Used for Translated Guest-Physical Addresses"
[^3]: AMD64 Architecture Programmer's Manual, Vol. 2, Sec. 15.25.8 "Combining Memory Types, MTRRs"
[^4]: SVM does not set `shadow_memtype_mask` or implement the `get_mt_mask` static call; see also fc07e76ac7ff ('Revert "KVM: SVM: use NPT page attributes"')
[^5]: https://github.com/NVIDIA/libnvidia-container/blob/95d3e86522976061e856724867ebcaf75c4e9b60/src/nvc_info.c#L85

### Is this feature related to a specific bug?

_No response_

### Do you have a specific solution in mind?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow nvproxy to be enabled on platform/kvm #11436

Description

Is this feature related to a specific bug?

Do you have a specific solution in mind?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow nvproxy to be enabled on platform/kvm #11436

Description

Description

Is this feature related to a specific bug?

Do you have a specific solution in mind?

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions