-
Notifications
You must be signed in to change notification settings - Fork 6
InferX Snapshot and restore configuration
inferx-net edited this page Apr 28, 2025
·
3 revisions
InferX snapshot is a memory snapshot of inference container. It includes 3 major parts:
- GPU data: This is GPU memory snapshot of GPUs. For vLLM inference framework, it will includes both model data and allocated KV cache vRAM space. As vLLM will fill the available vRAM with KV cache, the GPU data is near the GPU vRAM size - other cuda context Overhead. In the restore phase, the data will be copied back to GPU through Host to Device (i.e. H2D) memory copy. It is restore latency is bound by H2D PCIe bus bandwdith. For example, for H100 80GB, which has maximum 80GB vRAM per GPU and actually 50GB/sec (64GB/s in thoery) PCIe Gen5 bandwidth, it is GPU data latency is about 80 GB / 50 GB/sec ~= 1.6 sec.
- Pageable data: This is the inference container OS memory snapshot. As this part of data will be consume by Linux OS with mmap systemcall, the restore could be either file backed mmap and then on-demand load or copy full in memory and then mmap with mlock.
- Pinned data: This is the CPU memory which allocated with CudaHostAlloc. It is Pinned memory so we can't restore that with file based mmap and on-demand load. We have to copy all of them in memory.
To minimize the snapshot restore latency, InferX needs load snapshot from high bandwidth store. So far there are 3 options:
- Memory: The data is stored in Linux OS file.In the container restore pre-warm phase, the data will be loaded to memory from OS file. When doing restore, the GPU data will be copied directly from memory to GPU; the pageable data will be mmap with mlock and the Pinned data will be registed to GPU. This option has lowest latency but it consume host memory so it has low deployment density. It will have longer pre-warm latency.
- File: The is stored in Linux OS file. When restore, InferX will copy them to GPU memory or CPU memory. As there is limited bandwidth to read Linux OS file, it will lead to high restore latency. InferX will use GPU Direct Storage to restore the data which will double the file read bandwidth over the Linux OS file system. For Gen4 SSD, GPU Direct Storage could provides about 6GB/s bandwidth while the Linux OS file system could only reach about 3GB/s.
- Blob: This is a Virtual RAID file system which built on multiple nvme SSDs. It is based SPDK disk driver and bypass Linux kernel file system driver. To minimize the restore latency, InferX relies on multiple SSDs to provide as least bandwidth of GPUs. For example, if your system has 2 H100 GPU which has 2 PCIe Gen interface. We needs at mininal 8 Gen5 SSD or 16 Gen4 SSD.
User can configure the restore through the model configuration file standby section as below.
"standby": {
"gpu": "Mem",
"pageable": "File",
"pinned": "Blob"
}
As Blob store needs more SSD and SPDK, user could enable/disable it in Node configuration's "enableBlobStore". In the inferx Makefile https://github.com/inferx-net/inferx/blob/main/Makefile, user 2 options to start/stop inferx service:
- Without Blob: make run/make stop
- With Blob: make runblob/make stopblob