InferX Demo Overview

The inferX platform showcases its capabilities through a comprehensive demonstration, highlighting its efficiency in GPU utilization and rapid deployment.

Environment Setup

Machine Model: Dell Precision 7960 Tower
CPU: Intel Xeon W5-3423
Memory: 256 GB
GPUs: 2 × NVIDIA RTX A4000, each with 16 GB vRAM

Deployment Density

Model Deployment: Over 40 models are deployed on a single node.
GPU Requirements: Each model utilizes 1 or 2 GPUs. Conventionally, running these models with dedicated GPUs would require a total of 70 GPUs.
Achieved Density: With only 2 GPUs available, inferX achieves a deployment density of 3500%, significantly surpassing the traditional inference platforms' density of approximately 80–90%.

Ultra-Low Cold Start Latency

Resource Management: The 40+ models share 2 GPUs. Upon receiving a request for a specific model, inferX checks for an existing warm instance. If unavailable, it identifies idle GPUs and initiates a cold start to serve the request.
Performance: The system can cold start a 12B model in under 2 seconds, ensuring rapid response times.

This demonstration underscores inferX's ability to maximize GPU resource utilization and deliver swift inference services.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InferX Demo Overview

Environment Setup

Deployment Density

Ultra-Low Cold Start Latency

Clone this wiki locally