Skip to content

Conversation

@LukeAVanDrie
Copy link
Contributor

Reuses aiohttp.ClientSession across requests in openAIModelServerClient to reduce connection overhead. This change improves client-side throughput and latency.

Additional improvements:

  • Refines error handling to distinguish between network errors (like aiohttp.ClientError), non-200 HTTP status codes, and errors during response processing.
  • Ensures non-200 responses with text bodies are captured.
  • Guarantees response body is always consumed to release connections.

Reuses aiohttp.ClientSession across requests in openAIModelServerClient
to reduce connection overhead. This change improves client-side
throughput and latency.

Additional improvements:
- Refines error handling to distinguish between network errors
  (like aiohttp.ClientError), non-200 HTTP status codes, and errors
	during response processing.
- Ensures non-200 responses with text bodies are captured.
- Guarantees response body is always consumed to release connections.
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 7, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LukeAVanDrie
Once this PR has been reviewed and has the lgtm label, please assign achandrasekar for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 7, 2025
)
)

end_time = time.perf_counter()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can move to a finally block

Copy link
Collaborator

@jjk-g jjk-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding! One nit

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 7, 2025
Copy link
Contributor

@achandrasekar achandrasekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add how the change was tested and if you have any numbers on improvements that'd be great too?

@achandrasekar
Copy link
Contributor

Please address the linting and type check issue above

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 16, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jjk-g
Copy link
Collaborator

jjk-g commented Oct 23, 2025

@LukeAVanDrie friendly ping for linting and type check errors

elif not tokenizer_config:
tokenizer_config = CustomTokenizerConfig(pretrained_model_name_or_path=self.model_name)
self.tokenizer = CustomTokenizer(tokenizer_config)
self.session = aiohttp.ClientSession(connector=aiohttp.TCPConnector(limit=self.max_tcp_connections))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I'm wrong, but isn't openAIModelServerClient shared across multiple asyncio event loops because of multiprocessing? Creating a single ClientSession here might cause issues if the same instance is also being shared to all the multiprocessing workers.

Relevant link: https://stackoverflow.com/questions/62707369/one-aiohttp-clientsession-per-thread

diamondburned added a commit to diamondburned/inference-perf that referenced this pull request Nov 11, 2025
Slightly refactor `openAIModelServerClient` to accept a custom
`aiohttp.ClientSession` per request, which allows us to use exactly 1
client session per worker.

Prior to this commit, a new `aiohttp.ClientSession` is created for each
request. Not only is this inefficient and lowers throughput, on certain
environments, it also leads to inotify watch issues:

    aiodns - WARNING - Failed to create DNS resolver channel with
    automatic monitoring of resolver configuration changes. This usually
    means the system ran out of inotify watches. Falling back to socket
    state callback. Consider increasing the system inotify watch limit:
    Failed to initialize c-ares channel

Indeed, because each DNS resolver is created for a new `ClientSession`,
creating tons of new `ClientSession`s causes eventual inotify watch
exhaustion. Sharing `ClientSession`s solves this issue.

Relevant links:

- https://docs.aiohttp.org/en/stable/http_request_lifecycle.html
- https://stackoverflow.com/questions/62707369/one-aiohttp-clientsession-per-thread
- home-assistant/core#144457 (comment)

Relevant PR: kubernetes-sigs#247
(doesn't address the issue of worker sharing).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants