fix flaky compaction test by cecemei · Pull Request #19157 · apache/druid

cecemei · 2026-03-13T23:53:39Z

fix flaky test:

switch to index task from kafka ingestion
wait for new segment to show up in BrokerServerView before querying for total rows

This PR has:

gianm · 2026-03-15T20:01:06Z

A flake on this same test happened in the checks for this PR:

2026-03-14T00:22:35.8238939Z [ERROR] org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.test_minorCompactionWithMSQ(PartitionsSpec)[1] -- Time elapsed: 14.15 s <<< FAILURE!
2026-03-14T00:22:35.8240292Z org.opentest4j.AssertionFailedError: expected: <2000> but was: <2500>
2026-03-14T00:22:35.8241262Z 	at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
2026-03-14T00:22:35.8242119Z 	at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
2026-03-14T00:22:35.8242956Z 	at org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
2026-03-14T00:22:35.8243603Z 	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
2026-03-14T00:22:35.8244240Z 	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
2026-03-14T00:22:35.8245464Z 	at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:531)
2026-03-14T00:22:35.8247116Z 	at org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.waitUntilPublishedRecordsAreIngested(CompactionSupervisorTest.java:337)
2026-03-14T00:22:35.8249294Z 	at org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.test_minorCompactionWithMSQ(CompactionSupervisorTest.java:255)

cecemei · 2026-03-16T00:12:13Z

A flake on this same test happened in the checks for this PR:

2026-03-14T00:22:35.8238939Z [ERROR] org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.test_minorCompactionWithMSQ(PartitionsSpec)[1] -- Time elapsed: 14.15 s <<< FAILURE!
2026-03-14T00:22:35.8240292Z org.opentest4j.AssertionFailedError: expected: <2000> but was: <2500>
2026-03-14T00:22:35.8241262Z 	at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
2026-03-14T00:22:35.8242119Z 	at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
2026-03-14T00:22:35.8242956Z 	at org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
2026-03-14T00:22:35.8243603Z 	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
2026-03-14T00:22:35.8244240Z 	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
2026-03-14T00:22:35.8245464Z 	at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:531)
2026-03-14T00:22:35.8247116Z 	at org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.waitUntilPublishedRecordsAreIngested(CompactionSupervisorTest.java:337)
2026-03-14T00:22:35.8249294Z 	at org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.test_minorCompactionWithMSQ(CompactionSupervisorTest.java:255)

yes this is due to intermediatePersistPeriod set to PT10M, sometimes only 500 events are persisted to the segment and the next 500 events + next 1000 events are persisted to another segment but in this case processed events metric would be 2500.

kfaraz · 2026-03-16T04:17:12Z

If the test is slow enough to hit the intermediatePersistPeriod of PT10M, may be we should reconsider the approach.

Should we just use batch append to simplify the test and make it more deterministic (and faster)?

cecemei · 2026-03-16T07:01:53Z

actually it's not due to intermediatePersistPeriod, i'm not sure why but i think supervisor is shutting down tasks constantly, probably due to No task in pending completion taskGroup[0] succeeded before completion timeout elapsed, and completion timeout is set to 5s in tests.

this waitUntilPublishedRecordsAreIngested is used in multiple tests, e.x. FaultyClusterTest. i wonder they are also flaky or maybe it's because i updated the schema to inflate the segment size which made the test flaky somehow.

kfaraz · 2026-03-16T07:16:06Z

actually it's not due to intermediatePersistPeriod, i'm not sure why but i think supervisor is shutting down tasks constantly, probably due to No task in pending completion taskGroup[0] succeeded before completion timeout elapsed, and completion timeout is set to 5s in tests.

If that is the case, you could try increasing the completionTimeout and the taskDuration (I think the test currently uses 500ms). But that probably still doesn't guarantee that you would end up with the correct number of segments.

You could either just use batch append instead of a Kafka supervisor.
OR
Relax the assertions on the segment count and just verify that a minor compaction has actually occurred.

FYI, #19151 updates the KafkaClusterMetricsTest to run Kafka supervisor with minor compaction. So, I think we may skip trying to use Kafka supervisor in the CompactionSupervisorTest for the time being.

cecemei · 2026-03-17T05:44:46Z

actually it's not due to intermediatePersistPeriod, i'm not sure why but i think supervisor is shutting down tasks constantly, probably due to No task in pending completion taskGroup[0] succeeded before completion timeout elapsed, and completion timeout is set to 5s in tests.

If that is the case, you could try increasing the completionTimeout and the taskDuration (I think the test currently uses 500ms). But that probably still doesn't guarantee that you would end up with the correct number of segments.

You could either just use batch append instead of a Kafka supervisor. OR Relax the assertions on the segment count and just verify that a minor compaction has actually occurred.

FYI, #19151 updates the KafkaClusterMetricsTest to run Kafka supervisor with minor compaction. So, I think we may skip trying to use Kafka supervisor in the CompactionSupervisorTest for the time being.

updated to use an index task instead of kafka, PTAL!

test-flaky

6a83692

cecemei marked this pull request as ready for review March 13, 2026 23:53

processed

e8d7a0e

kfaraz mentioned this pull request Mar 16, 2026

Improve extensibility of MSQ Dart engine via extensions #19127

Open

10 tasks

cecemei added 3 commits March 16, 2026 16:06

batch-ingest

cec4cc2

flaky

1f1ca3d

flaky

789de9d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix flaky compaction test#19157

fix flaky compaction test#19157
cecemei wants to merge 5 commits intoapache:masterfrom
cecemei:flaky

cecemei commented Mar 13, 2026 •

edited

Loading

Uh oh!

gianm commented Mar 15, 2026 •

edited

Loading

Uh oh!

cecemei commented Mar 16, 2026

Uh oh!

kfaraz commented Mar 16, 2026

Uh oh!

cecemei commented Mar 16, 2026

Uh oh!

kfaraz commented Mar 16, 2026

Uh oh!

cecemei commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cecemei commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cecemei commented Mar 16, 2026

Uh oh!

kfaraz commented Mar 16, 2026

Uh oh!

cecemei commented Mar 16, 2026

Uh oh!

kfaraz commented Mar 16, 2026

Uh oh!

cecemei commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cecemei commented Mar 13, 2026 •

edited

Loading

gianm commented Mar 15, 2026 •

edited

Loading