Connection breaks when scaling Triton servers deployed via interLink

When Triton servers are deployed on interLink virtual nodes (in Slurm jobs), a single-server setup works fine. However, when number of clients is high enough to trigger autoscaling, the client-server connections break and clients fail.

My main suspicion is that servers started on interLink nodes take too long to load (time to establish wstunnel, pulling singularity image, maybe something else).

Possible ways to fix:
- tweak readiness probes for Triton servers
- would be even better if Triton server itself would appear ready a bit later, eliminating a need to fine-tune Kubernetes probes 
- currently it's inconvenient to debug - maybe a better aggregation of logs would help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Connection breaks when scaling Triton servers deployed via interLink #78

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Connection breaks when scaling Triton servers deployed via interLink #78

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions