Fix flaky Cosmos DB tests and critical NullPointerException bugs in CI by kushagraThapar · Pull Request #48064 · Azure/azure-sdk-for-java

kushagraThapar · 2026-02-21T22:18:07Z

Description

Fix additional flaky Cosmos DB tests and add transient error retry to TestSuiteBase

Summary

Fix additional flaky Cosmos DB tests beyond PR #48025, and add systematic transient error retry handling to TestSuiteBase resource creation methods. Based on analysis of 15 consecutive main branch CI builds (all failed, ~21 test failures each, ~3h 15m average runtime).

Changes (5 files, +63/-15)

Commit 1: Fix additional flaky tests beyond PR #48025

TestSuiteBase.truncateCollection: Add null guards for collection and altLink to prevent NPE when @BeforeSuite initialization fails. This cascading failure affected ~15+ tests across multiple matrix jobs (RequestHeadersSpyWireTest, DocumentQuerySpyWireContentTest, SessionTest, ConsistencyTests).
ClientMetricsTest: Increase timeOut from TIMEOUT (40s) to TIMEOUT * 2 (80s) for effectiveMetricCategoriesForDefault and effectiveMetricCategoriesForAllLatebound — these methods call createCollection() internally which can exceed 40s under CI load.
ClientRetryPolicyE2ETests: Relax duration assertion from 5s to 10s for dataPlaneRequestHitsLeaseNotFoundInFirstPreferredRegion — CI scheduling jitter causes actual durations of 5.04-5.49s.
OrderbyDocumentQueryTest: Add Retry.fixedDelay(3, 5s) for transient 408/429/503 errors during container creation in @BeforeClass setup.

Commit 2: Fix ReproTest and additional timeouts

ReproTest.runICM497415681OriginalReproTest: Use isGreaterThanOrEqualTo(1000) instead of isEqualTo(1000) since the test uses getSharedSinglePartitionCosmosContainer() which may contain leftover docs from previous test runs, causing assertions like expected 1000 but was 1004.
ClientRetryPolicyE2ETests: Increase timeOut from TIMEOUT to TIMEOUT * 2 for both dataPlaneRequestHitsLeaseNotFoundInFirstPreferredRegion and dataPlaneRequestHitsLeaseNotFoundAndResourceThrottleFirstPreferredRegion to prevent ThreadTimeoutException.

Commit 3: Add transient error retry to TestSuiteBase create methods

Add isTransientCreateFailure helper that checks for CosmosException with status codes 408 (RequestTimeout), 429 (TooManyRequests), or 503 (ServiceUnavailable).
Add Retry.fixedDelay(3, Duration.ofSeconds(5)) to all resource creation methods:
- createCollection (3 overloads)
- safeCreateDatabase
- createDatabase
- createDatabaseIfNotExists
These methods are called from @BeforeClass/@BeforeSuite of most test classes. Transient failures during resource creation previously cascaded into dozens of test failures when the setup method failed without retry.

Root Causes Addressed

#	Root Cause	Severity	Tests Affected
1	NPE cascade in `truncateCollection` when `altLink` is null	High	~15+ tests across 6 matrix jobs
2	Timeout too low (40s) for CI environments under load	Medium	ClientMetricsTest, ClientRetryPolicyE2ETests
3	Duration assertions too tight (5s gets 5.04-5.49s in CI)	Low	ClientRetryPolicyE2ETests
4	Transient 408/429/503 in setup methods with no retry	High	All 443 test classes extending TestSuiteBase
5	Shared container leftover data breaks exact-count assertion	Low	ReproTest

CI Impact

Fixes affect tests that fail in 6+ matrix jobs per build (StrongSession, Session, ConsistentPrefix, BoundedStaleness, MultiMaster)
The TestSuiteBase retry changes protect all 443 test classes that extend it from transient setup failures
Based on 15 recent builds: ~38,700 tests pass, ~21 fail, ~3,500 not executed per build

Testing

These are test infrastructure changes — they will be validated by the CI pipeline itself.

Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

- ClientMetricsTest.readItem: Increased timeout from TIMEOUT (40s) to SETUP_TIMEOUT (60s) to handle collection creation delays in TestState initialization - PerPartitionCircuitBreakerE2ETests.miscellaneousDocumentOperationHitsTerminalExceptionAcrossKRegionsGateway: Increased timeout from 4*TIMEOUT (160s) to 5*TIMEOUT (200s) and added FlakyTestRetryAnalyzer to handle transient circuit breaker failures Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

- ContainerCreateDeleteWithSameNameTest.bulk: Add 500ms delay after bulk operations to allow indexing to complete before querying - PointWriterITest upsert if not modified: Add 100ms delay after flushAndClose to allow metrics aggregation to complete Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

- Add lazy initialization helpers getWriteRegionsForDataProvider() and getReadRegionsForDataProvider() - Replace all this.writeRegions and this.readRegions calls in data providers with helper methods - Fix missing readRegions initialization in beforeClass() - Add null check in ClientRetryPolicyE2ETests for preferredRegions.subList() Data providers execute before @BeforeClass, causing NPE when accessing uninitialized region lists. Lazy init ensures regions are available when data providers need them. Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

- SessionTest: Increase TIMEOUT from 20s to 60s for sessionTokenNotRequired test - ClientMetricsTest.maxValueExceedingDefinedLimitStillWorksWithoutException: TIMEOUT -> SETUP_TIMEOUT - FaultInjectionServerErrorRuleOnDirectTests: Increase address refresh validation retry from 5s to 10s - NonStreamingOrderByQueryVectorSearchTest: Increase SETUP_TIMEOUT from 20s to 60s - IncrementalChangeFeedProcessorTest: Add FlakyTestRetryAnalyzer to 5 tests that fail due to transient network errors during setup Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

- PerPartitionCircuitBreakerE2ETests: Replace remaining 5 occurrences of this.readRegions.subList() in data providers with getReadRegionsForDataProvider().subList() - ClientRetryPolicyE2ETests: Use SkipException instead of silently skipping validation when preferredRegions is null or has <2 elements - ContainerCreateDeleteWithSameNameTest: Restore interrupt flag before throwing RuntimeException for InterruptedException - ExcludeRegionTests: Separate InterruptedException handling to restore interrupt flag and fail fast; add descriptive error message Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

…rification In merge scenarios where the same lease is reused: 1. First addOrUpdateLease calls acquire() and schedules worker 2. Worker encounters FeedRangeGoneException 3. handleFeedRangeGone calls addOrUpdateLease again with same lease 4. Second call may invoke acquire() (if worker stopped) or updateProperties() (if still running) This is a race condition - the timing varies in CI. Changed verification from times(1) to atLeast(1)/atMost(2) to accept both outcomes. Increased wait time from 500ms to 2000ms for async operation chains to complete. Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

…count Test fails intermittently with transient network errors: - CosmosException 410/0 (Gone) - channel closed with pending requests - CosmosException 408/10002 (Request Timeout) - address resolution timeout Root cause: maxRetryCount = 0 means no retries on transient failures Fix: Increased maxRetryCount from 0 to 3 (consistent with other PointWriter tests) This allows the test to retry on transient network issues instead of failing immediately. Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

…counts CosmosItemWriteRetriesTest.createItem: - Added FlakyTestRetryAnalyzer to handle transient 409 conflicts - When fault injection delays (5s each) cause channel closures (410/20001), retries with tracking IDs can complete out of order - One retry succeeds while others eventually get 409 CONFLICT after 4 retries - Retry analyzer handles this timing variation (up to 2 retries of entire test) PointWriterSubpartitionITest - "can create item with duplicates": - Increased maxRetryCount from 0 to 3 - Test fails intermittently with CosmosException 410/0 (channel closed) and 408/0 (timeout) - Consistent with PointWriterITest fix and other Spark tests Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

…back Test fails with "0 did not equal 1" for recordsWrittenSnapshot. Root cause: Race condition between Spark internal metrics completion and onTaskEnd callback execution: 1. Write completes and metricValues computed 2. Test's eventually block succeeds (metricValues != null) 3. onTaskEnd callback fires asynchronously to update snapshot variables 4. Assertion runs before callback updates recordsWrittenSnapshot (still 0) Fix: Added eventually block to wait for recordsWrittenSnapshot > 0 before asserting exact value. This ensures onTaskEnd callback has completed before validation. Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

…lay to 1000ms Test still fails intermittently with 8/10 items despite previous 500ms delay. Root cause: Indexing lag in CI can exceed 500ms for bulk operations on high-throughput containers (10100 RU/s). Fix: Increased delay from 500ms to 1000ms to provide adequate time for indexing to complete before querying. Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

…instead of fixed delay Test still fails intermittently with 9999 vs 10000 despite 100ms delay. Root cause analysis: - Metrics are updated synchronously in write operations before futures complete - flushAndClose() waits for all futures, so metrics should be complete - However, 100ms fixed delay is insufficient and doesn't guarantee completion Better solution: Replace Thread.sleep(100) with eventually block (10s timeout, 100ms polling): - Polls until metrics >= expected count - Handles timing variations robustly - Times out with clear message if metrics never reach expected value - Consistent with SparkE2EWriteITest fix (commit 1954acc) This provides a more reliable solution than fixed delays. Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

Error: "cannot be applied to (org.scalatest.matchers.Matcher[Int])" at line 313 Root cause: metricsPublisher.getRecordsWrittenSnapshot() returns Long, but (2 * items.size) is Int. The matcher `be >= (2 * items.size)` creates Matcher[Int], causing type mismatch when applied to Long. Fix: Convert comparison value to Long with .toLong Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

…ion for race condition Test now fails on partitionSupervisorFactory.create being called 2 times instead of 1. This is the same race condition as acquire, but manifesting differently: 1. First addOrUpdateLease -> acquire -> create (line 75) -> schedules worker 2. Worker hits FeedRangeGoneException -> handleFeedRangeGone 3. Second addOrUpdateLease with same lease 4. If worker stopped and removed from currentlyOwnedPartitions, the check at line 73 (checkTask == null) passes 5. This causes create to be called again Fix: Relax verification for create from times(1) to atLeast(1)/atMost(2), matching the acquire verification pattern. Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

…tion for race condition Test now fails on leaseManager.release being called 2 times instead of 1. This is the same race condition affecting acquire and create: 1. First addOrUpdateLease -> worker starts -> FeedRangeGoneException -> removeLease -> release (call #1) 2. handleFeedRangeGone returns same lease -> second addOrUpdateLease 3. If timing causes second worker to also hit exception quickly -> removeLease -> release (call Azure#2) Fix: Relax verification for release from times(1) to atLeast(1)/atMost(2), matching acquire and create patterns. Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

- TestSuiteBase.truncateCollection: Add null guards for collection and altLink to prevent NPE when @BeforeSuite initialization fails - ClientMetricsTest: Increase timeout from 40s to 80s for effectiveMetricCategoriesForDefault and effectiveMetricCategoriesForAllLatebound - ClientRetryPolicyE2ETests: Relax duration assertions from 5s to 10s for dataPlaneRequestHitsLeaseNotFoundInFirstPreferredRegion to accommodate CI latency - OrderbyDocumentQueryTest: Add retry logic with 3 retries for transient 408/429/503 errors during container creation in @BeforeClass setup Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- ReproTest: Use isGreaterThanOrEqualTo(1000) instead of isEqualTo(1000) since the test uses a shared container that may have leftover docs - ClientRetryPolicyE2ETests: Increase timeOut from TIMEOUT to TIMEOUT*2 for dataPlaneRequestHitsLeaseNotFoundInFirstPreferredRegion and dataPlaneRequestHitsLeaseNotFoundAndResourceThrottleFirstPreferredRegion to prevent ThreadTimeoutException in CI Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add retry with fixedDelay(3, 5s) for transient 408/429/503 errors to: - createCollection (3 overloads) - safeCreateDatabase - createDatabase - createDatabaseIfNotExists These methods are called from @BeforeClass/@BeforeSuite of most test classes. Transient failures during resource creation cascade into dozens of test failures when the setup method fails without retry. The isTransientCreateFailure helper checks for CosmosException with status codes 408 (RequestTimeout), 429 (TooManyRequests), or 503 (ServiceUnavailable). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

1. ConsistencyTests1.validateSessionContainerAfterCollectionCreateReplace: - Added missing altLink to SHARED_DATABASE_INTERNAL initialization - BridgeInternal.getAltLink(createdDatabase) returned null causing IllegalArgumentException - altLink should be "dbs/{databaseId}" matching selfLink format 2. ResourceTokenTest.readDocumentFromResouceToken: - Added FlakyTestRetryAnalyzer for transient ServiceUnavailableException 503 errors - Resource token operations can fail transiently in CI due to service load 3. ReproTest.runICM497415681OriginalReproTest: - Added FlakyTestRetryAnalyzer for off-by-one failures (1000 vs 1001) - Uses shared container without cleanup, leftover documents from previous tests cause count mismatches - Retry analyzer handles transient data contamination Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

… verification Test expects updateProperties to be called exactly once, but it's never called in the race condition scenario. Root cause analysis: - updateProperties is only called when second addOrUpdateLease finds worker still running (checkTask != null) - If worker has stopped (checkTask == null), acquire is called instead - In CI, timing often results in worker stopping before second addOrUpdateLease - This produces: 2×acquire, 2×release, 0×updateProperties (not 1×updateProperties) Fix: Changed verification from times(1) to atMost(1) to accept both outcomes: - 0 calls (worker stopped, took acquire path both times) - 1 call (worker still running on second addOrUpdateLease, took updateProperties path) This completes the handleMerge race condition fix across all lease manager operations. Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

FabianMeiswinkel · 2026-02-22T08:50:56Z

sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/OrderbyDocumentQueryTest.java

+                .filter(throwable -> {
+                    if (throwable instanceof CosmosException) {
+                        int statusCode = ((CosmosException) throwable).getStatusCode();
+                        return statusCode == 408 || statusCode == 429 || statusCode == 503;


503 would very likely be that region does not have enough resources/capacity? That would fail all the time. Not sure whether it would really ever help. Keeping 408/429 is probably fine.

I thought the same but noticed through our design docs that we retry on 503s internally in our SDKs, so thought it would be fair to retry on setup methods as well. I will confirm if it really helps or not by verifying the live tests and see if we observe 503s during those and if this helps.

FabianMeiswinkel · 2026-02-22T08:51:36Z

sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/TestSuiteBase.java

+    private static boolean isTransientCreateFailure(Throwable t) {
+        if (t instanceof CosmosException) {
+            int statusCode = ((CosmosException) t).getStatusCode();
+            return statusCode == 408 || statusCode == 429 || statusCode == 503;


FabianMeiswinkel

LGTM except for 503 retry on CreateContainer

…ure-sdk-for-java into kushagrathapar/fix-additional-flaky-cosmos-tests

Copilot

Pull request overview

This PR addresses flakiness in Cosmos DB tests by adding retry analyzers, adjusting timeouts, and implementing systematic transient error handling for resource creation.

Changes:

Added FlakyTestRetryAnalyzer to multiple tests prone to timing-related failures in CI
Increased timeouts for setup operations and tests that create collections (20s→60s, 40s→80s)
Added retry logic with 3 attempts and 5-second delays for transient errors (408/429/503) in TestSuiteBase resource creation methods
Relaxed timing assertions and added null guards to prevent cascading failures

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
IncrementalChangeFeedProcessorTest.java	Added `FlakyTestRetryAnalyzer` to 5 tests and improved comments with timing explanations
TestSuiteBase.java	Added transient error retry to all resource creation methods and null guards in `truncateCollection`
ResourceTokenTest.java	Added `FlakyTestRetryAnalyzer` to `readDocumentFromResouceToken` test
OrderbyDocumentQueryTest.java	Added retry logic for container creation in `@BeforeClass` setup
NonStreamingOrderByQueryVectorSearchTest.java	Increased `SETUP_TIMEOUT` from 20s to 60s
ContainerCreateDeleteWithSameNameTest.java	Added 1s delay after bulk operations to ensure indexing completes
ClientRetryPolicyE2ETests.java	Increased timeouts and relaxed duration assertions from 5s to 10s
PartitionControllerImplTests.java	Relaxed verify expectations to handle race conditions in merge scenarios
SessionTest.java	Increased `TIMEOUT` from 20s to 60s
FaultInjectionServerErrorRuleOnDirectTests.java	Increased address refresh deadline from 5s to 10s
ReproTest.java	Changed assertions to `isGreaterThanOrEqualTo` and added `FlakyTestRetryAnalyzer`
PerPartitionCircuitBreakerE2ETests.java	Added lazy initialization helpers for data providers and added `FlakyTestRetryAnalyzer`
ExcludeRegionTests.java	Replaced fixed sleeps with retry-based replication verification
CosmosItemWriteRetriesTest.java	Added `FlakyTestRetryAnalyzer` to `createItem` test
CosmosDiagnosticsTest.java	Added 100ms delay after item creation to ensure replication
ClientMetricsTest.java	Increased timeouts from `TIMEOUT` to `SETUP_TIMEOUT` or `TIMEOUT * 2`
SparkE2EWriteITest.scala	Added `eventually` block for async metric updates
PointWriterSubpartitionITest.scala	Changed `maxRetryCount` from 0 to 3
PointWriterITest.scala	Changed `maxRetryCount` from 0 to 3 and added `eventually` block for metrics

sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/TestSuiteBase.java

sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/rx/ClientRetryPolicyE2ETests.java

...java/com/azure/cosmos/implementation/changefeed/epkversion/PartitionControllerImplTests.java

...cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/cris/querystuckrepro/ReproTest.java

...re-cosmos-tests/src/test/java/com/azure/cosmos/rx/ContainerCreateDeleteWithSameNameTest.java

sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/ExcludeRegionTests.java

sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/CosmosDiagnosticsTest.java

…x/TestSuiteBase.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…x/ClientRetryPolicyE2ETests.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…mplementation/changefeed/epkversion/PartitionControllerImplTests.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…ris/querystuckrepro/ReproTest.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…xcludeRegionTests.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- ContainerCreateDeleteWithSameNameTest: Replace 1000ms fixed sleep with polling loop that queries until all bulk items are indexed (up to 10 retries with 500ms intervals) - CosmosDiagnosticsTest: Replace 100ms fixed sleep with retry-based read verification to confirm item creation is propagated before testing with wrong partition key Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…mplTests Fixes compilation error: cannot find symbol at line 215 where timeout(2000) was used without the corresponding static import. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add timeout(2000) to release() and handlePartitionGone() verifications so they wait for the async worker to complete instead of failing immediately when the operations haven't executed yet. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kushagraThapar · 2026-02-23T01:28:52Z

/azp run java - cosmos - tests

azure-pipelines · 2026-02-23T01:29:20Z

Azure Pipelines successfully started running 1 pipeline(s).

- ReproTest: Add testRunId field to documents and filter query to isolate from other tests sharing the same container (root cause: SELECT * FROM c returns data from concurrent tests, inflating count from 1000 to 3005) - CosmosNotFoundTests: Add retryAnalyzer and increase container deletion wait from 5s to 15s for cache propagation (sub-status 0 vs 1003) - FaultInjectionServerErrorRuleOnDirectTests: Add retryAnalyzer for LeaseNotFound test (address refresh race condition in diagnostics) - ClientRetryPolicyE2ETests: Add retryAnalyzer for LeaseNotFound test (transient 503 ServiceUnavailableException) - ClientMetricsTest: Add SuperFlakyTestRetryAnalyzer to endpointMetricsAreDurable (40s timeout flakiness) - StoredProcedureUpsertReplaceTest: Add retryAnalyzer to executeStoredProcedure (40s timeout) - TriggerUpsertReplaceTest: Increase setup timeout from SETUP_TIMEOUT to 2*SETUP_TIMEOUT for cleanUpContainer (60s insufficient under load) - WorkflowTest: Add retry loop for collection creation in setup (408 ReadTimeout during createCollection) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kushagraThapar · 2026-02-23T04:30:17Z

/azp run java - cosmos - tests

azure-pipelines · 2026-02-23T04:30:52Z

Azure Pipelines successfully started running 1 pipeline(s).

Use eventually block to poll readAllItems() until all 5000 items are indexed and visible via query, instead of asserting immediately after flushAndClose(). This handles the case where indexing has not completed for all items when the query executes (4999 vs 5000). Consistent with the pattern used for metrics polling in the same test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- ExcludeRegionTests: Fix IllegalArgumentException by changing OperationType.Head to OperationType.Read in replication check. performDocumentOperation does not handle Head, causing all 28 parameterized variants to fail deterministically. - ClientMetricsTest.replaceItem: Add SuperFlakyTestRetryAnalyzer (40s timeout) - DocumentQuerySpyWireContentTest: Double setup timeout for 429 throttling - QueryValidationTests: Add retryAnalyzer to queryOptionNullValidation and queryLargePartitionKeyOn100BPKCollection (40s timeouts) - FITests_queryAfterCreation already has retryAnalyzer (transient 408) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI and others added 20 commits February 17, 2026 22:39

Initial plan

54def19

Fix flaky tests - improve timing and assertions

a63af21

Co-authored-by: kushagraThapar <14034156+kushagraThapar@users.noreply.github.com>

Merge branch 'main' into copilot/fix-flaky-tests-java-ci

5736061

github-actions bot added the Cosmos label Feb 21, 2026

Copilot AI and others added 2 commits February 22, 2026 02:26

FabianMeiswinkel reviewed Feb 22, 2026

View reviewed changes

FabianMeiswinkel approved these changes Feb 22, 2026

View reviewed changes

kushagraThapar changed the title ~~Kushagrathapar/fix additional flaky cosmos tests~~ Fix flaky Cosmos DB tests and critical NullPointerException bugs in CI Feb 22, 2026

Merge branch 'copilot/fix-flaky-tests-java-ci' of github.com:Azure/az…

9f99207

…ure-sdk-for-java into kushagrathapar/fix-additional-flaky-cosmos-tests

kushagraThapar mentioned this pull request Feb 22, 2026

Fix flaky Cosmos DB tests and critical NullPointerException bugs in CI #48025

Closed

kushagraThapar marked this pull request as ready for review February 22, 2026 17:13

kushagraThapar requested a review from kirankumarkolli as a code owner February 22, 2026 17:13

Copilot AI review requested due to automatic review settings February 22, 2026 17:13

kushagraThapar requested review from a team as code owners February 22, 2026 17:13

Copilot started reviewing on behalf of kushagraThapar February 22, 2026 17:14 View session

Copilot AI reviewed Feb 22, 2026

View reviewed changes

kushagraThapar and others added 8 commits February 22, 2026 14:47

Update sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/r…

a6cc421

…x/TestSuiteBase.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/r…

cb6394e

…x/ClientRetryPolicyE2ETests.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/i…

085e502

…mplementation/changefeed/epkversion/PartitionControllerImplTests.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/c…

b227221

…ris/querystuckrepro/ReproTest.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/E…

e01f179

…xcludeRegionTests.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Add missing static import for Mockito.timeout in PartitionControllerI…

2a5ecf9

…mplTests Fixes compilation error: cannot find symbol at line 215 where timeout(2000) was used without the corresponding static import. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kushagraThapar and others added 2 commits February 23, 2026 08:28

Comments

Conversation

kushagraThapar commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Fix additional flaky Cosmos DB tests and add transient error retry to TestSuiteBase

Summary

Changes (5 files, +63/-15)

Commit 1: Fix additional flaky tests beyond PR #48025

Commit 2: Fix ReproTest and additional timeouts

Commit 3: Add transient error retry to TestSuiteBase create methods

Root Causes Addressed

CI Impact

Testing

Uh oh!

FabianMeiswinkel Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kushagraThapar Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

FabianMeiswinkel Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

FabianMeiswinkel left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kushagraThapar commented Feb 23, 2026

Uh oh!

azure-pipelines bot commented Feb 23, 2026

Uh oh!

kushagraThapar commented Feb 23, 2026

Uh oh!

azure-pipelines bot commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kushagraThapar commented Feb 21, 2026 •

edited

Loading

FabianMeiswinkel Feb 22, 2026 •

edited

Loading