-
Notifications
You must be signed in to change notification settings - Fork 6
InferX Demo Overview
inferx-net edited this page Mar 7, 2025
·
2 revisions
The inferX platform showcases its capabilities through a comprehensive demonstration, highlighting its efficiency in GPU utilization and rapid deployment.
- Machine Model: Dell Precision 7960 Tower
- CPU: Intel Xeon W5-3423
- Memory: 256 GB
- GPUs: 2 × NVIDIA RTX A4000, each with 16 GB vRAM
- Model Deployment: Over 40 models are deployed on a single node.
- GPU Requirements: Each model utilizes 1 or 2 GPUs. Conventionally, running these models with dedicated GPUs would require a total of 70 GPUs.
- Achieved Density: With only 2 GPUs available, inferX achieves a deployment density of 3500%, significantly surpassing the traditional inference platforms' density of approximately 80–90%.
- Resource Management: The 40+ models share 2 GPUs. Upon receiving a request for a specific model, inferX checks for an existing warm instance. If unavailable, it identifies idle GPUs and initiates a cold start to serve the request.
- Performance: The system can cold start a 12B model in under 2 seconds, ensuring rapid response times.
This demonstration underscores inferX's ability to maximize GPU resource utilization and deliver swift inference services.