Skip to content

Connection breaks when scaling Triton servers deployed via interLink #78

@kondratyevd

Description

@kondratyevd

When Triton servers are deployed on interLink virtual nodes (in Slurm jobs), a single-server setup works fine. However, when number of clients is high enough to trigger autoscaling, the client-server connections break and clients fail.

My main suspicion is that servers started on interLink nodes take too long to load (time to establish wstunnel, pulling singularity image, maybe something else).

Possible ways to fix:

  • tweak readiness probes for Triton servers
  • would be even better if Triton server itself would appear ready a bit later, eliminating a need to fine-tune Kubernetes probes
  • currently it's inconvenient to debug - maybe a better aggregation of logs would help

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions