Skip to content

InferX: Advanced GPU‐Based Serverless Inference Platform

inferx-net edited this page Mar 7, 2025 · 8 revisions

inferX is an Inference Function as a Service (FaaS) platform engineered to optimize GPU-based serverless inference. It addresses critical challenges in GPU utilization and multitenant resource sharing through innovative solutions:

Key Features

1. High GPU Utilization (90+%)

  • On-Demand GPU Provisioning: inferX dynamically allocates GPU resources upon incoming requests, during failovers, or when scaling out, eliminating the need for pre-provisioned GPUs.

  • Ultra-Fast Cold Start: By overcoming cold start challenges, inferX achieves cold start times under 5 seconds. Demonstrations have shown cold starts of under 2 seconds for dual GPU 12B models.

  • Cost Efficiency: Optimized GPU utilization leads to up to 90% usage rates, resulting in up to 80% savings in inference costs.

2. Multitenant GPU Sharing

  • Advanced Isolation Mechanisms: InferX solves the isolation challenges. Beyond traditional CPU-based isolation methods like cgroups and Virtual Private Clouds (VPCs), inferX implements GPU-specific isolation techniques, including virtual GPU memory (vRAM) isolation and NCCL (NVIDIA Collective Communications Library) isolation.

  • Performance Integrity: These isolation strategies ensure that one tenant's workload does not interfere with another's, maintaining consistent performance across multitenant environments.

By integrating these features, inferX sets a new standard in GPU-based serverless inference, delivering high performance, secure multitenancy, and cost-effective AI services.