-
Notifications
You must be signed in to change notification settings - Fork 6
InferX: Advanced GPU‐Based Serverless Inference Platform
inferX is an Inference Function as a Service (FaaS) platform engineered to optimize GPU-based serverless inference. It addresses critical challenges in GPU utilization and multitenant resource sharing through innovative solutions:
-
On-Demand GPU Provisioning: inferX dynamically allocates GPU resources upon incoming requests, during failovers, or when scaling out, eliminating the need for pre-provisioned GPUs.
-
Ultra-Fast Cold Start: By overcoming cold start challenges, inferX achieves cold start times under 5 seconds. Demonstrations have shown cold starts of under 2 seconds for dual GPU 12B models.
-
Cost Efficiency: Optimized GPU utilization leads to up to 90% usage rates, resulting in up to 80% savings in inference costs.
-
Advanced Isolation Mechanisms: InferX solves the isolation challenges. Beyond traditional CPU-based isolation methods like cgroups and Virtual Private Clouds (VPCs), inferX implements GPU-specific isolation techniques, including virtual GPU memory (vRAM) isolation and NCCL (NVIDIA Collective Communications Library) isolation.
-
Performance Integrity: These isolation strategies ensure that one tenant's workload does not interfere with another's, maintaining consistent performance across multitenant environments.
By integrating these features, inferX sets a new standard in GPU-based serverless inference, delivering high performance, secure multitenancy, and cost-effective AI services.