Skip to content

fix flaky compaction test#19157

Open
cecemei wants to merge 5 commits intoapache:masterfrom
cecemei:flaky
Open

fix flaky compaction test#19157
cecemei wants to merge 5 commits intoapache:masterfrom
cecemei:flaky

Conversation

@cecemei
Copy link
Contributor

@cecemei cecemei commented Mar 13, 2026

fix flaky test:

  • switch to index task from kafka ingestion
  • wait for new segment to show up in BrokerServerView before querying for total rows

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@cecemei cecemei marked this pull request as ready for review March 13, 2026 23:53
@gianm
Copy link
Contributor

gianm commented Mar 15, 2026

A flake on this same test happened in the checks for this PR:

2026-03-14T00:22:35.8238939Z [ERROR] org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.test_minorCompactionWithMSQ(PartitionsSpec)[1] -- Time elapsed: 14.15 s <<< FAILURE!
2026-03-14T00:22:35.8240292Z org.opentest4j.AssertionFailedError: expected: <2000> but was: <2500>
2026-03-14T00:22:35.8241262Z 	at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
2026-03-14T00:22:35.8242119Z 	at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
2026-03-14T00:22:35.8242956Z 	at org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
2026-03-14T00:22:35.8243603Z 	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
2026-03-14T00:22:35.8244240Z 	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
2026-03-14T00:22:35.8245464Z 	at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:531)
2026-03-14T00:22:35.8247116Z 	at org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.waitUntilPublishedRecordsAreIngested(CompactionSupervisorTest.java:337)
2026-03-14T00:22:35.8249294Z 	at org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.test_minorCompactionWithMSQ(CompactionSupervisorTest.java:255)

@cecemei
Copy link
Contributor Author

cecemei commented Mar 16, 2026

A flake on this same test happened in the checks for this PR:

2026-03-14T00:22:35.8238939Z [ERROR] org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.test_minorCompactionWithMSQ(PartitionsSpec)[1] -- Time elapsed: 14.15 s <<< FAILURE!
2026-03-14T00:22:35.8240292Z org.opentest4j.AssertionFailedError: expected: <2000> but was: <2500>
2026-03-14T00:22:35.8241262Z 	at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151)
2026-03-14T00:22:35.8242119Z 	at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132)
2026-03-14T00:22:35.8242956Z 	at org.junit.jupiter.api.AssertEquals.failNotEqual(AssertEquals.java:197)
2026-03-14T00:22:35.8243603Z 	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:150)
2026-03-14T00:22:35.8244240Z 	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:145)
2026-03-14T00:22:35.8245464Z 	at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:531)
2026-03-14T00:22:35.8247116Z 	at org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.waitUntilPublishedRecordsAreIngested(CompactionSupervisorTest.java:337)
2026-03-14T00:22:35.8249294Z 	at org.apache.druid.testing.embedded.compact.CompactionSupervisorTest.test_minorCompactionWithMSQ(CompactionSupervisorTest.java:255)

yes this is due to intermediatePersistPeriod set to PT10M, sometimes only 500 events are persisted to the segment and the next 500 events + next 1000 events are persisted to another segment but in this case processed events metric would be 2500.

@kfaraz
Copy link
Contributor

kfaraz commented Mar 16, 2026

If the test is slow enough to hit the intermediatePersistPeriod of PT10M, may be we should reconsider the approach.

Should we just use batch append to simplify the test and make it more deterministic (and faster)?

@cecemei
Copy link
Contributor Author

cecemei commented Mar 16, 2026

actually it's not due to intermediatePersistPeriod, i'm not sure why but i think supervisor is shutting down tasks constantly, probably due to No task in pending completion taskGroup[0] succeeded before completion timeout elapsed, and completion timeout is set to 5s in tests.

this waitUntilPublishedRecordsAreIngested is used in multiple tests, e.x. FaultyClusterTest. i wonder they are also flaky or maybe it's because i updated the schema to inflate the segment size which made the test flaky somehow.

@kfaraz
Copy link
Contributor

kfaraz commented Mar 16, 2026

actually it's not due to intermediatePersistPeriod, i'm not sure why but i think supervisor is shutting down tasks constantly, probably due to No task in pending completion taskGroup[0] succeeded before completion timeout elapsed, and completion timeout is set to 5s in tests.

If that is the case, you could try increasing the completionTimeout and the taskDuration (I think the test currently uses 500ms). But that probably still doesn't guarantee that you would end up with the correct number of segments.

You could either just use batch append instead of a Kafka supervisor.
OR
Relax the assertions on the segment count and just verify that a minor compaction has actually occurred.

FYI, #19151 updates the KafkaClusterMetricsTest to run Kafka supervisor with minor compaction. So, I think we may skip trying to use Kafka supervisor in the CompactionSupervisorTest for the time being.

@cecemei
Copy link
Contributor Author

cecemei commented Mar 17, 2026

actually it's not due to intermediatePersistPeriod, i'm not sure why but i think supervisor is shutting down tasks constantly, probably due to No task in pending completion taskGroup[0] succeeded before completion timeout elapsed, and completion timeout is set to 5s in tests.

If that is the case, you could try increasing the completionTimeout and the taskDuration (I think the test currently uses 500ms). But that probably still doesn't guarantee that you would end up with the correct number of segments.

You could either just use batch append instead of a Kafka supervisor. OR Relax the assertions on the segment count and just verify that a minor compaction has actually occurred.

FYI, #19151 updates the KafkaClusterMetricsTest to run Kafka supervisor with minor compaction. So, I think we may skip trying to use Kafka supervisor in the CompactionSupervisorTest for the time being.

updated to use an index task instead of kafka, PTAL!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants