Skip to content

Commit 44e35bf

Browse files
ankita-nvgregkh
authored andcommitted
vfio/nvgrace-gpu: Expose the blackwell device PF BAR1 to the VM
[ Upstream commit 6a9eb2d ] There is a HW defect on Grace Hopper (GH) to support the Multi-Instance GPU (MIG) feature [1] that necessiated the presence of a 1G region carved out from the device memory and mapped as uncached. The 1G region is shown as a fake BAR (comprising region 2 and 3) to workaround the issue. The Grace Blackwell systems (GB) differ from GH systems in the following aspects: 1. The aforementioned HW defect is fixed on GB systems. 2. There is a usable BAR1 (region 2 and 3) on GB systems for the GPUdirect RDMA feature [2]. This patch accommodate those GB changes by showing the 64b physical device BAR1 (region2 and 3) to the VM instead of the fake one. This takes care of both the differences. Moreover, the entire device memory is exposed on GB as cacheable to the VM as there is no carveout required. Link: https://www.nvidia.com/en-in/technologies/multi-instance-gpu/ [1] Link: https://docs.nvidia.com/cuda/gpudirect-rdma/ [2] Cc: Kevin Tian <[email protected]> CC: Jason Gunthorpe <[email protected]> Suggested-by: Alex Williamson <[email protected]> Signed-off-by: Ankit Agrawal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alex Williamson <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
1 parent 18457b6 commit 44e35bf

File tree

1 file changed

+45
-22
lines changed
  • drivers/vfio/pci/nvgrace-gpu

1 file changed

+45
-22
lines changed

drivers/vfio/pci/nvgrace-gpu/main.c

Lines changed: 45 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,6 @@
1717
#define RESMEM_REGION_INDEX VFIO_PCI_BAR2_REGION_INDEX
1818
#define USEMEM_REGION_INDEX VFIO_PCI_BAR4_REGION_INDEX
1919

20-
/* Memory size expected as non cached and reserved by the VM driver */
21-
#define RESMEM_SIZE SZ_1G
22-
2320
/* A hardwired and constant ABI value between the GPU FW and VFIO driver. */
2421
#define MEMBLK_SIZE SZ_512M
2522

@@ -72,7 +69,7 @@ nvgrace_gpu_memregion(int index,
7269
if (index == USEMEM_REGION_INDEX)
7370
return &nvdev->usemem;
7471

75-
if (index == RESMEM_REGION_INDEX)
72+
if (nvdev->resmem.memlength && index == RESMEM_REGION_INDEX)
7673
return &nvdev->resmem;
7774

7875
return NULL;
@@ -757,40 +754,67 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev,
757754
u64 memphys, u64 memlength)
758755
{
759756
int ret = 0;
757+
u64 resmem_size = 0;
760758

761759
/*
762-
* The VM GPU device driver needs a non-cacheable region to support
763-
* the MIG feature. Since the device memory is mapped as NORMAL cached,
764-
* carve out a region from the end with a different NORMAL_NC
765-
* property (called as reserved memory and represented as resmem). This
766-
* region then is exposed as a 64b BAR (region 2 and 3) to the VM, while
767-
* exposing the rest (termed as usable memory and represented using usemem)
768-
* as cacheable 64b BAR (region 4 and 5).
760+
* On Grace Hopper systems, the VM GPU device driver needs a non-cacheable
761+
* region to support the MIG feature owing to a hardware bug. Since the
762+
* device memory is mapped as NORMAL cached, carve out a region from the end
763+
* with a different NORMAL_NC property (called as reserved memory and
764+
* represented as resmem). This region then is exposed as a 64b BAR
765+
* (region 2 and 3) to the VM, while exposing the rest (termed as usable
766+
* memory and represented using usemem) as cacheable 64b BAR (region 4 and 5).
769767
*
770768
* devmem (memlength)
771769
* |-------------------------------------------------|
772770
* | |
773771
* usemem.memphys resmem.memphys
772+
*
773+
* This hardware bug is fixed on the Grace Blackwell platforms and the
774+
* presence of the bug can be determined through nvdev->has_mig_hw_bug.
775+
* Thus on systems with the hardware fix, there is no need to partition
776+
* the GPU device memory and the entire memory is usable and mapped as
777+
* NORMAL cached (i.e. resmem size is 0).
774778
*/
779+
if (nvdev->has_mig_hw_bug)
780+
resmem_size = SZ_1G;
781+
775782
nvdev->usemem.memphys = memphys;
776783

777784
/*
778785
* The device memory exposed to the VM is added to the kernel by the
779-
* VM driver module in chunks of memory block size. Only the usable
780-
* memory (usemem) is added to the kernel for usage by the VM
781-
* workloads. Make the usable memory size memblock aligned.
786+
* VM driver module in chunks of memory block size. Note that only the
787+
* usable memory (usemem) is added to the kernel for usage by the VM
788+
* workloads.
782789
*/
783-
if (check_sub_overflow(memlength, RESMEM_SIZE,
790+
if (check_sub_overflow(memlength, resmem_size,
784791
&nvdev->usemem.memlength)) {
785792
ret = -EOVERFLOW;
786793
goto done;
787794
}
788795

789796
/*
790-
* The USEMEM part of the device memory has to be MEMBLK_SIZE
791-
* aligned. This is a hardwired ABI value between the GPU FW and
792-
* VFIO driver. The VM device driver is also aware of it and make
793-
* use of the value for its calculation to determine USEMEM size.
797+
* The usemem region is exposed as a 64B Bar composed of region 4 and 5.
798+
* Calculate and save the BAR size for the region.
799+
*/
800+
nvdev->usemem.bar_size = roundup_pow_of_two(nvdev->usemem.memlength);
801+
802+
/*
803+
* If the hardware has the fix for MIG, there is no requirement
804+
* for splitting the device memory to create RESMEM. The entire
805+
* device memory is usable and will be USEMEM. Return here for
806+
* such case.
807+
*/
808+
if (!nvdev->has_mig_hw_bug)
809+
goto done;
810+
811+
/*
812+
* When the device memory is split to workaround the MIG bug on
813+
* Grace Hopper, the USEMEM part of the device memory has to be
814+
* MEMBLK_SIZE aligned. This is a hardwired ABI value between the
815+
* GPU FW and VFIO driver. The VM device driver is also aware of it
816+
* and make use of the value for its calculation to determine USEMEM
817+
* size. Note that the device memory may not be 512M aligned.
794818
*/
795819
nvdev->usemem.memlength = round_down(nvdev->usemem.memlength,
796820
MEMBLK_SIZE);
@@ -809,10 +833,9 @@ nvgrace_gpu_init_nvdev_struct(struct pci_dev *pdev,
809833
}
810834

811835
/*
812-
* The memory regions are exposed as BARs. Calculate and save
813-
* the BAR size for them.
836+
* The resmem region is exposed as a 64b BAR composed of region 2 and 3
837+
* for Grace Hopper. Calculate and save the BAR size for the region.
814838
*/
815-
nvdev->usemem.bar_size = roundup_pow_of_two(nvdev->usemem.memlength);
816839
nvdev->resmem.bar_size = roundup_pow_of_two(nvdev->resmem.memlength);
817840
done:
818841
return ret;

0 commit comments

Comments
 (0)