Skip to content

fix(serverless): use executionTimeoutMs as runsync client timeout#272

Merged
deanq merged 5 commits intomainfrom
fix/AE-2375-runsync-client-timeout
Apr 1, 2026
Merged

fix(serverless): use executionTimeoutMs as runsync client timeout#272
deanq merged 5 commits intomainfrom
fix/AE-2375-runsync-client-timeout

Conversation

@deanq
Copy link
Copy Markdown
Member

@deanq deanq commented Mar 15, 2026

Summary

  • ServerlessResource.runsync() hardcoded a 60s HTTP client timeout regardless of executionTimeoutMs, causing GPU inference jobs exceeding 1 minute to always fail with a client-side timeout
  • Now uses executionTimeoutMs / 1000 as the client timeout when set (non-zero, positive), falling back to 60s default

Changes

File Change
core/resources/serverless.py Derive client timeout from self.executionTimeoutMs with explicit is not None and > 0 guard
tests/unit/resources/test_serverless.py 3 tests: custom timeout, zero fallback, and None fallback

Test plan

  • test_runsync_uses_execution_timeout — verifies executionTimeoutMs=300000 results in timeout=300
  • test_runsync_default_timeout_when_zero — verifies executionTimeoutMs=0 falls back to timeout=60
  • test_runsync_default_timeout_when_none — verifies executionTimeoutMs=None falls back to timeout=60
  • Full make quality-check passes (85.40% coverage)

Closes AE-2375

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes ServerlessResource.runsync() always using a hardcoded 60s client timeout by deriving the HTTP client timeout from executionTimeoutMs (when configured), preventing long-running GPU inference jobs from failing client-side.

Changes:

  • Add a module-level default runsync timeout constant (60s).
  • Compute the runsync HTTP client timeout from executionTimeoutMs / 1000 when non-zero; otherwise fall back to the default.
  • Add unit tests covering configured timeout and default fallbacks.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/runpod_flash/core/resources/serverless.py Derives runsync client timeout from executionTimeoutMs with a 60s fallback constant.
tests/unit/resources/test_serverless.py Adds async unit tests asserting the timeout passed to rp_client.post() for set/zero/None executionTimeoutMs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@deanq deanq requested a review from runpod-Henrik March 15, 2026 04:32
Copy link
Copy Markdown
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. The fix — correct

Clean implementation. The guard is not None and > 0 correctly rejects None, 0, and negative values, all falling back to the 60s default. No issues with the core logic.


2. Question: no buffer between server limit and client timeout

executionTimeoutMs=300_000 → client timeout of exactly 300.0s. The RunPod /runsync endpoint will kill the job at 300_000ms server-side, then return a response. If there's any network overhead between the server killing the job and the response arriving at the client, the client HTTP timeout fires first — same failure mode as before, just pushed out by however long the job runs.

A small buffer is conventional here:

timeout_s: float = (
    self.executionTimeoutMs / 1000 + 10   # +10s for network round-trip
    if self.executionTimeoutMs is not None and self.executionTimeoutMs > 0
    else DEFAULT_RUNSYNC_TIMEOUT_S
)

Not blocking — the fix is still a massive improvement over the hardcoded 60s — but worth a decision either way. If intentionally no buffer, a comment explaining why would close the question.


3. Test gap: negative executionTimeoutMs not covered

The guard rejects executionTimeoutMs < 0 (falls back to 60s), and commit edc425c explicitly called this out, but there's no test for it. A test would lock in that behavior:

async def test_runsync_default_timeout_when_negative(self):
    resource = ServerlessResource(name="test-neg", executionTimeoutMs=-1)
    resource.id = "ep-neg"
    # ... assert timeout=60

4. Scope note: run() polling path unaffected

This fix is specific to runsync. The async run() → polling path uses a different timeout model (RunPod SDK's await job.result() with its own polling interval). Not a regression here, just worth noting if AE-2375 was originally reported against both paths.


5. Known gap (pre-existing, not introduced here)

From the known-bugs register: execution_timeout_ms is not enforced server-side — workers ignore the configured timeout. This PR fixes the client-side mismatch only. If a job runs past executionTimeoutMs, the server doesn't kill it, so the client will now correctly wait the full configured duration. The server-side enforcement gap remains open separately.


Verdict

PASS. The fix resolves the stated bug correctly. One ask: decide on the buffer question (#2) and either add +10 or add a comment explaining the no-buffer choice.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

@deanq deanq force-pushed the fix/AE-2375-runsync-client-timeout branch from edc425c to a3d5e7d Compare March 17, 2026 20:00
Copy link
Copy Markdown
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up on prior review

The explicit is not None and > 0 guard and DEFAULT_RUNSYNC_TIMEOUT_S = 60 constant are clean. The three test cases cover the main branches correctly.

The buffer question from the prior review is still open — executionTimeoutMs / 1000 gives exactly the server-side limit with no room for network round-trip. Still non-blocking, but worth noting.

Two minor gaps the tests don't cover:

  • Negative value (executionTimeoutMs=-100) — the > 0 guard correctly falls back to 60s but there's no test for it
  • Very small value (executionTimeoutMs=500) → timeout=0.5s — passes the guard and would cause an immediate client timeout on any real request; no validation prevents it

Verdict: PASS WITH NITS

Core fix is correct. The gaps above are low-risk in practice.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

@deanq deanq force-pushed the fix/AE-2375-runsync-client-timeout branch from a3d5e7d to 6a22440 Compare March 19, 2026 18:04
@promptless
Copy link
Copy Markdown

promptless bot commented Mar 23, 2026

📝 Documentation updates detected!

New suggestion: Document runsync client timeout behavior


Tip: Leave inline comments with @Promptless on suggestion diffs in the Promptless dashboard for targeted refinements 💬

@deanq deanq force-pushed the fix/AE-2375-runsync-client-timeout branch 2 times, most recently from 38be4fa to ccfde45 Compare March 25, 2026 23:06
@deanq deanq requested review from Copilot and runpod-Henrik March 26, 2026 09:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes ServerlessResource.runsync() to respect executionTimeoutMs when determining the underlying HTTP client timeout, preventing long-running inference jobs from failing due to a hardcoded 60s client-side timeout.

Changes:

  • Compute runsync HTTP timeout from executionTimeoutMs / 1000 when executionTimeoutMs is set and positive; otherwise fall back to a 60s default.
  • Add/adjust unit tests to validate the timeout behavior for custom, zero, and None execution timeouts.
  • Remove a large block of unrelated unit tests from tests/unit/resources/test_serverless.py (not mentioned in the PR description).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/runpod_flash/core/resources/serverless.py Derives runsync client timeout from executionTimeoutMs with a 60s default constant.
tests/unit/resources/test_serverless.py Adds runsync timeout tests, but also deletes substantial existing coverage for template env injection and update() behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1752 to 1754
class TestServerlessRunsyncTimeout:
"""Test that runsync uses executionTimeoutMs as client timeout."""

Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff removes a large set of existing unit tests (e.g. for _inject_template_env, _build_template_update_payload, and update() env preservation/injection behavior) and I can't find equivalent coverage elsewhere in tests/. Since the corresponding production code still exists in serverless.py, this looks like an unintended deletion that significantly reduces regression coverage. Please restore these tests (or move them to a new test module) and keep this PR focused on the runsync-timeout change, or explicitly document/justify the removal if it’s intentional.

Copilot uses AI. Check for mistakes.
deanq added 5 commits April 1, 2026 09:27
runsync() hardcoded a 60s HTTP client timeout regardless of
executionTimeoutMs, causing GPU inference jobs exceeding 1 minute to
fail with a client-side timeout even when the server allowed more time.

Now uses executionTimeoutMs / 1000 when set, falling back to 60s default.

Closes AE-2375
- Move DEFAULT_RUNSYNC_TIMEOUT_S to module-level constant
- Add test for executionTimeoutMs=None fallback
Replace truthiness check with `is not None and > 0` to reject negative
values. Annotate timeout_s as float for clarity.
The update test needs get_template mocked to return a real dict,
otherwise the AsyncMock auto-attribute returns a coroutine from
.get("env") which is not iterable.
…ayload

These test classes were accidentally deleted during an earlier rebase.
They cover template env injection, API key injection, and update()
env handling — unrelated to the runsync timeout fix.

Also refreshes uv.lock to resolve stale runpod-python git commit.
@deanq deanq force-pushed the fix/AE-2375-runsync-client-timeout branch from b3463d2 to 058ca36 Compare April 1, 2026 18:49
@deanq deanq merged commit 49451d4 into main Apr 1, 2026
4 checks passed
@deanq deanq deleted the fix/AE-2375-runsync-client-timeout branch April 1, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants