You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
platform/kvm configures all guest page table entries to enable write-back caching, which is not generally correct for device mappings. On Intel CPUs, KVM mostly mitigates this by mapping uncacheable and write-combining pages as UC in EPT1; the hardware takes roughly the intersection of the EPT and guest page table memory type2. On AMD CPUs, the hardware has the same capability3, but KVM does not configure it4, so guest page table entires must set memory type correctly to obtain correct behavior. I don't think plumbing memory type through the sentry should be too difficult, but matching the kernel driver's memory types will be tricky; we will probably have to approximate conservatively in some cases, which may degrade performance.
Mappings of any given file offset in /dev/nvidia-uvm must be at the matching virtual address. In KVM, providing mappings of /dev/nvidia-uvm to the guest requires mapping it into the KVM-using process' (the sentry's) address space and then forwarding it into the guest using a KVM memslot. Thus any UVM mapping at an offset that conflicts with an existing sentry mapping is unimplementable. AFAIK, this only affects CUDA - Vulkan and NVENC/DEC do not use UVM - so this should no longer block use of nvproxy with platform/kvm in any case. But it is still a problem for CUDA.
Experimentally:
CUDA-using binaries unconditionally map 2 MB of /dev/nvidia-uvm at a fixed address (in cuda_malloc_test this happens to be 0x205000000, but I'm unsure if this is consistent between binaries). This happens to work (at least in my minimal testing) if nvproxy.uvmFDMemmapFile.MapInternal() is changed to attempt a mapping using MAP_FIXED_NOREPLACE.
cudaMallocManaged() reserves application address space using mmap(MAP_PRIVATE|MAP_ANONYMOUS) and then overwrites the reservation mapping with a MAP_FIXED mapping of /dev/nvidia-uvm, which sometimes collides with an existing sentry mapping depending on ASLR for both the sentry and application.
Options:
Keep the existing implementation of uvmFDMemmapFile.MapInternal(), which will unconditionally cause CUDA binaries to fail on platform/kvm.
Try to use MAP_FIXED_NOREPLACE for the sentry mapping. This should cause CUDA binaries that don't use cudaMallocManaged() to succeed (good), but will cause CUDA binaries that do to flake (bad). This is probably unacceptable for use cases that run arbitrary GPU workloads, but might be fine for others; AFAIK use of cudaMallocManaged() is uncommon for performance reasons.
Use MAP_FIXED_NOREPLACE, and use a custom ELF loader to load the sentry in an address range that is less likely to collide with application addresses. I'm not sure this would actually work since IIUC the interfering sentry mappings are from runtime mmaps.
Use MAP_FIXED_NOREPLACE, and also try to avoid returning application mappings that would collide with existing sentry mappings. This is racy and leaks information about the sentry to applications, so it's probably undesirable.
Fail at startup if platform/kvm is in use, nvproxy is enabled, and NVIDIA_DRIVER_CAPABILITIES contains compute. I mention this option mostly to rule it out; some containers in practice specify NVIDIA_DRIVER_CAPABILITIES=all even if only e.g. graphics support is required, and in fact NVIDIA's Vulkan support requires libnvidia-gpucomp.so which libnvidia-container only mounts when --compute is specified5 so this is necessary!
Description
Per https://github.com/google/gvisor/blob/master/g3doc/proposals/nvidia_driver_proxy.md, nvproxy is currently incompatible with platform/kvm for two reasons:
platform/kvm configures all guest page table entries to enable write-back caching, which is not generally correct for device mappings. On Intel CPUs, KVM mostly mitigates this by mapping uncacheable and write-combining pages as UC in EPT1; the hardware takes roughly the intersection of the EPT and guest page table memory type2. On AMD CPUs, the hardware has the same capability3, but KVM does not configure it4, so guest page table entires must set memory type correctly to obtain correct behavior. I don't think plumbing memory type through the sentry should be too difficult, but matching the kernel driver's memory types will be tricky; we will probably have to approximate conservatively in some cases, which may degrade performance.
Mappings of any given file offset in
/dev/nvidia-uvm
must be at the matching virtual address. In KVM, providing mappings of/dev/nvidia-uvm
to the guest requires mapping it into the KVM-using process' (the sentry's) address space and then forwarding it into the guest using a KVM memslot. Thus any UVM mapping at an offset that conflicts with an existing sentry mapping is unimplementable. AFAIK, this only affects CUDA - Vulkan and NVENC/DEC do not use UVM - so this should no longer block use of nvproxy with platform/kvm in any case. But it is still a problem for CUDA.Experimentally:
/dev/nvidia-uvm
at a fixed address (incuda_malloc_test
this happens to be 0x205000000, but I'm unsure if this is consistent between binaries). This happens to work (at least in my minimal testing) ifnvproxy.uvmFDMemmapFile.MapInternal()
is changed to attempt a mapping usingMAP_FIXED_NOREPLACE
.cudaMallocManaged()
reserves application address space usingmmap(MAP_PRIVATE|MAP_ANONYMOUS)
and then overwrites the reservation mapping with aMAP_FIXED
mapping of/dev/nvidia-uvm
, which sometimes collides with an existing sentry mapping depending on ASLR for both the sentry and application.Options:
uvmFDMemmapFile.MapInternal()
, which will unconditionally cause CUDA binaries to fail on platform/kvm.MAP_FIXED_NOREPLACE
for the sentry mapping. This should cause CUDA binaries that don't usecudaMallocManaged()
to succeed (good), but will cause CUDA binaries that do to flake (bad). This is probably unacceptable for use cases that run arbitrary GPU workloads, but might be fine for others; AFAIK use ofcudaMallocManaged()
is uncommon for performance reasons.MAP_FIXED_NOREPLACE
, and use a custom ELF loader to load the sentry in an address range that is less likely to collide with application addresses. I'm not sure this would actually work since IIUC the interfering sentry mappings are from runtime mmaps.MAP_FIXED_NOREPLACE
, and also try to avoid returning application mappings that would collide with existing sentry mappings. This is racy and leaks information about the sentry to applications, so it's probably undesirable.NVIDIA_DRIVER_CAPABILITIES
containscompute
. I mention this option mostly to rule it out; some containers in practice specifyNVIDIA_DRIVER_CAPABILITIES=all
even if only e.g. graphics support is required, and in fact NVIDIA's Vulkan support requireslibnvidia-gpucomp.so
whichlibnvidia-container
only mounts when--compute
is specified5 so this is necessary!Is this feature related to a specific bug?
No response
Do you have a specific solution in mind?
No response
Footnotes
Linux:
arch/x86/kvm/mmu/spte.c:make_spte()
=>kvm_is_mmio_pfn()
,arch/x86/kvm/vmx/vmx.c:vmx_get_mt_mask()
↩Intel 64 and IA-32 Software Developer Manual, Vol. 3, Sec. 30.3.7.2 "Memory Type Used for Translated Guest-Physical Addresses" ↩
AMD64 Architecture Programmer's Manual, Vol. 2, Sec. 15.25.8 "Combining Memory Types, MTRRs" ↩
SVM does not set
shadow_memtype_mask
or implement theget_mt_mask
static call; see also fc07e76ac7ff ('Revert "KVM: SVM: use NPT page attributes"') ↩https://github.com/NVIDIA/libnvidia-container/blob/95d3e86522976061e856724867ebcaf75c4e9b60/src/nvc_info.c#L85 ↩
The text was updated successfully, but these errors were encountered: