[PECOBLR-1735] Fix #729 and #731: Telemetry lifecycle management #734

msrathore-db · 2026-01-29T12:05:47Z

What type of PR is this?

Refactor
Feature
Bug Fix
Other

Description

Issue #729: Connection failures hang CICD jobs for 15+ minutes, cannot be cancelled

Telemetry forced on even with enable_telemetry=False
900-second timeout causes long hangs
Blocking shutdown prevents process exit

Issue #731: Race condition causes AttributeError: 'NoneType' object has no attribute 'request'

Async telemetry executes after _http_client is garbage collected
No null check before calling pool_manager.request()

Changes

Issue #729 (3 fixes)

Respect enable_telemetry parameter (telemetry_client.py:722-729, client.py:321)
- Added parameter to connection_failure_log() with early return if disabled
- Connection passes user's preference through
Reduce timeout to 30s (telemetry_client.py:299)
- Changed from 900s to 30s for faster failure
Non-blocking shutdown (telemetry_client.py:707)
- Changed executor.shutdown(wait=True) to wait=False

Issue #731 (3 fixes)

Null-safety check (unified_http_client.py:296-298)
- Check if pool_manager is None before calling .request()
- Raise RequestError with clear message
Add del cleanup (telemetry_client.py:439-451)
- Close _http_client during garbage collection
- Ensures eventual resource cleanup
Document lifecycle (telemetry_client.py:430-435)
- Added docstring explaining why _http_client isn't closed in close()
- Import RequestError from databricks.sql.exc

Impact

Connection failures exit in <1s instead of 15+ minutes
CICD jobs no longer hang
No more race condition AttributeError
enable_telemetry=False now respected in all scenarios

Backward Compatibility

Fully backward compatible. No API changes, only behavior improvements.

Closes #729
Closes #731

How is this tested?

Unit tests
E2E Tests
Manually
N/A

Related Tickets & Documents

github-actions · 2026-01-29T12:06:00Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

samikshya-db

Thanks for the fix!

src/databricks/sql/telemetry/telemetry_client.py

Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>

Per reviewer feedback on PR #734: 1. Revert timeout from 30s back to 900s (line 299) - Reviewer noted that with wait=False, timeout is not critical - The async nature and wait=False handle the exit speed 2. Revert telemetry_enabled parameter back to True (line 734) - Reviewer noted this is redundant given the early return - If enable_telemetry=False, we return early (line 729) - Line 734 only executes when enable_telemetry=True - Therefore using the parameter here is unnecessary These changes address the reviewer's valid technical concerns while keeping the core fixes intact: - wait=False for non-blocking shutdown (critical for Issue #729) - Early return when enable_telemetry=False (critical for Issue #729) - All Issue #731 fixes (null-safety, __del__, documentation) Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>

Apply Black formatting to files modified in previous commits: - src/databricks/sql/common/unified_http_client.py - src/databricks/sql/telemetry/telemetry_client.py Changes are purely cosmetic (quote style consistency). Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>

Add @pytest.mark.xdist_group to telemetry test classes to ensure they run sequentially on the same worker when using pytest-xdist (-n auto). Root cause: Tests marked @pytest.mark.serial were still being parallelized in CI because pytest-xdist doesn't respect custom markers by default. With host-level telemetry batching (PR #718), tests running in parallel would share the same TelemetryClient and interfere with each other's event counting, causing test_concurrent_queries_sends_telemetry to see 88 events instead of the expected 60. The xdist_group marker ensures all tests in the "serial_telemetry" group run on the same worker sequentially, preventing state interference. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions · 2026-02-05T23:19:13Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Modified telemetry_setup_teardown fixtures to clean up TelemetryClientFactory state both BEFORE and AFTER each test, not just after. This prevents leftover state from previous tests (pending events, active executors) from interfering with the current test. Root cause: In CI with sequential execution on the same worker, if a previous test left pending telemetry events in the executor, those events could be captured by the next test's mock, causing inflated event counts (88 instead of 60). Now ensures complete isolation between tests by resetting all shared state before each test starts. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions · 2026-02-06T05:49:58Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

The _flush_event threading.Event was never cleared after stopping the flush thread, remaining in "set" state. This caused timing issues in subsequent tests where the Event was already signaled, triggering unexpected flush behavior and causing extra telemetry events to be captured (88 instead of 60). Now explicitly clear the _flush_event flag in both setup (before test) and teardown (after test) to ensure clean state isolation between tests. This explains why CI consistently got 88 events - the flush_event from previous tests triggered additional flushes during test execution. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions · 2026-02-06T06:43:19Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

1. Created new workflow 'test-telemetry-only.yml' that runs only the failing telemetry test with -n auto, mimicking real CI but much faster 2. Added debug output to test showing: - Client-side captured events - Number of futures/batches - Number of server responses - Server-reported successful events This will help identify why CI gets 88 events vs local 60 events. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions · 2026-02-06T07:38:54Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

The workflow was failing during poetry install due to missing krb5 system libraries needed for kerberos dependencies. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

Changes across multiple workflows: 1. integration.yml: - Add krb5 system dependency to telemetry job - Fixes: krb5-config command not found error during poetry install 2. code-coverage.yml: - Add krb5 system dependency - Split telemetry tests into separate step for isolation - Maintains coverage accumulation with --cov-append 3. publish-test.yml: - Add krb5 system dependency for consistent builds 4. test_concurrent_telemetry.py: - Remove debug print statements 5. Delete test-telemetry-only.yml: - Remove temporary debug workflow All workflows now have proper telemetry test isolation and required system dependencies for kerberos packages. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions · 2026-02-06T11:59:49Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

Poetry 2.3.2 installation fails with Python 3.9: Installing Poetry (2.3.2): An error occurred. Other workflows use Python 3.10 and work fine. Updating to match ensures consistency and avoids Poetry installation issues. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions · 2026-02-06T12:09:43Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

…tests - Remove --dist=loadgroup from non-telemetry job (only needed for telemetry) - Remove test_telemetry_e2e.py from telemetry job (was skipped before) - This should fix test_uc_volume_life_cycle failure caused by changed test distribution

…e tests - Only run test_concurrent_telemetry.py in isolated telemetry step - test_telemetry_e2e.py was excluded in original workflow, keep it excluded

- Always run poetry install (not just on cache miss) - Ensures fresh install with system dependencies (krb5) - Matches pattern used in integration.yml

github-actions · 2026-02-06T12:47:51Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

…le-issues-729-731

github-actions · 2026-02-06T13:09:01Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

…nditional - Remove duplicate system dependencies step - Restore cache conditional to match main branch - Keep Python 3.10 (our change from 3.9)

github-actions · 2026-02-06T13:20:08Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

- All serial tests are telemetry tests (test_concurrent_telemetry.py and test_telemetry_e2e.py) - They're already run in the isolated telemetry step - Running -m serial with --ignore on both files results in 0 tests (exit code 5)

github-actions · 2026-02-06T14:20:03Z

Thanks for your contribution! To satisfy the DCO policy in our contributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

msrathore-db had a problem deploying to azure-prod January 29, 2026 12:05 — with GitHub Actions Failure

samikshya-db approved these changes Feb 4, 2026

View reviewed changes

samikshya-db reviewed Feb 4, 2026

View reviewed changes

src/databricks/sql/telemetry/telemetry_client.py Outdated Show resolved Hide resolved

src/databricks/sql/telemetry/telemetry_client.py Outdated Show resolved Hide resolved

msrathore-db added 2 commits February 5, 2026 16:35

Fix #729 and #731: Telemetry lifecycle management

bcdc8a5

Signed-off-by: Madhavendra Rathore <madhavendra.rathore@databricks.com>

msrathore-db force-pushed the fix/telemetry-lifecycle-issues-729-731 branch from d4f9054 to 471a551 Compare February 5, 2026 11:30

msrathore-db temporarily deployed to azure-prod February 5, 2026 11:30 — with GitHub Actions Inactive

msrathore-db had a problem deploying to azure-prod February 5, 2026 11:30 — with GitHub Actions Failure

msrathore-db changed the title ~~Fix #729 and #731: Telemetry lifecycle management~~ [PECOBLR-1735] Fix #729 and #731: Telemetry lifecycle management Feb 5, 2026

msrathore-db had a problem deploying to azure-prod February 5, 2026 23:19 — with GitHub Actions Failure

msrathore-db temporarily deployed to azure-prod February 5, 2026 23:19 — with GitHub Actions Inactive

msrathore-db had a problem deploying to azure-prod February 6, 2026 05:49 — with GitHub Actions Failure

msrathore-db temporarily deployed to azure-prod February 6, 2026 06:43 — with GitHub Actions Inactive

msrathore-db had a problem deploying to azure-prod February 6, 2026 06:43 — with GitHub Actions Failure

msrathore-db had a problem deploying to azure-prod February 6, 2026 07:38 — with GitHub Actions Failure

msrathore-db temporarily deployed to azure-prod February 6, 2026 07:38 — with GitHub Actions Inactive

msrathore-db had a problem deploying to azure-prod February 6, 2026 07:38 — with GitHub Actions Failure

Fix workflow: Add krb5 system dependency

c558fae

The workflow was failing during poetry install due to missing krb5 system libraries needed for kerberos dependencies. Signed-off-by: Claude Sonnet 4.5 <noreply@anthropic.com>

msrathore-db had a problem deploying to azure-prod February 6, 2026 09:14 — with GitHub Actions Failure

msrathore-db had a problem deploying to azure-prod February 6, 2026 11:59 — with GitHub Actions Failure

msrathore-db temporarily deployed to azure-prod February 6, 2026 11:59 — with GitHub Actions Inactive

msrathore-db had a problem deploying to azure-prod February 6, 2026 12:09 — with GitHub Actions Failure

msrathore-db added 3 commits February 6, 2026 18:10

Fix code-coverage workflow: Remove test_telemetry_e2e.py from coverag…

a514ca2

…e tests - Only run test_concurrent_telemetry.py in isolated telemetry step - test_telemetry_e2e.py was excluded in original workflow, keep it excluded

Fix publish-test workflow: Remove cache conditional

0c01ba9

- Always run poetry install (not just on cache miss) - Ensures fresh install with system dependencies (krb5) - Matches pattern used in integration.yml

msrathore-db had a problem deploying to azure-prod February 6, 2026 12:47 — with GitHub Actions Failure

msrathore-db temporarily deployed to azure-prod February 6, 2026 12:47 — with GitHub Actions Inactive

Merge remote-tracking branch 'origin/main' into fix/telemetry-lifecyc…

74ea9cf

…le-issues-729-731

msrathore-db temporarily deployed to azure-prod February 6, 2026 13:08 — with GitHub Actions Inactive

msrathore-db had a problem deploying to azure-prod February 6, 2026 13:08 — with GitHub Actions Failure

Fix publish-test.yml: Remove duplicate krb5 install, restore cache co…

649a41d

…nditional - Remove duplicate system dependencies step - Restore cache conditional to match main branch - Keep Python 3.10 (our change from 3.9)

msrathore-db had a problem deploying to azure-prod February 6, 2026 13:19 — with GitHub Actions Failure

msrathore-db temporarily deployed to azure-prod February 6, 2026 13:19 — with GitHub Actions Inactive

msrathore-db temporarily deployed to azure-prod February 6, 2026 13:57 — with GitHub Actions Inactive

Fix code-coverage: Remove serial tests step

162302e

- All serial tests are telemetry tests (test_concurrent_telemetry.py and test_telemetry_e2e.py) - They're already run in the isolated telemetry step - Running -m serial with --ignore on both files results in 0 tests (exit code 5)

msrathore-db temporarily deployed to azure-prod February 6, 2026 14:19 — with GitHub Actions Inactive

msrathore-db temporarily deployed to azure-prod February 6, 2026 14:52 — with GitHub Actions Inactive

msrathore-db merged commit 61f8029 into main Feb 6, 2026
36 of 37 checks passed

[PECOBLR-1735] Fix #729 and #731: Telemetry lifecycle management #734

[PECOBLR-1735] Fix #729 and #731: Telemetry lifecycle management #734

Uh oh!

Conversation

msrathore-db commented Jan 29, 2026

What type of PR is this?

Description

Changes

Issue #729 (3 fixes)

Issue #731 (3 fixes)

Impact

How is this tested?

Related Tickets & Documents

Uh oh!

github-actions bot commented Jan 29, 2026

Uh oh!

samikshya-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants