Skip to content

Speed up BATS tests#4478

Merged
janhoy merged 20 commits into
apache:mainfrom
janhoy:feature/slim-bats-tests
Jun 4, 2026
Merged

Speed up BATS tests#4478
janhoy merged 20 commits into
apache:mainfrom
janhoy:feature/slim-bats-tests

Conversation

@janhoy
Copy link
Copy Markdown
Contributor

@janhoy janhoy commented May 30, 2026

The BATS test suite was taking longer and longer, now about 40 mins on GH CI. This PR makes targeted, low-risk reductions without removing meaningful coverage.

Total estimated CI time saving: ~10 minutes

Slowest tests are these
image

Changes

DISCLAIMER All improvements made by Claude Code LLM, be aware of rough edges

BATS nightly test filtering

Uses BATS 1.8.2's native tag support to gate slow/network-dependent tests instead of
manual if/skip blocks inside test code.

Tests tagged with # bats test_tags=nightly are excluded by default. To include them:

./gradlew integrationTests -Psolr.bats.nightly=true

Gate OpenNLP test behind the new test_tags=nightly tag

Since this test downloads external data and is slow / experimental

test_status.bats — one shared Solr startup (~200 s saved)

Four of six tests independently started and stopped their own Solr instance. Restructured with setup_file/teardown_file so all tests share a single startup (~3 Solr startups saved).

CI timing showed "status with --short format" at 197 s — almost certainly caused by solr stop in teardown hanging when each test managed its own Solr lifecycle. The new structure uses a single teardown_file stop which eliminates this.

test_healthcheck.bats — cheaper collection setup (~15 s saved)

Both tests used solr start -e films, which loads and indexes the full films dataset. Replaced with solr start + solr create -c healthcheck_test -d _default for the cloud test, and plain solr start --user-managed for the standalone test (which fails before any collection is needed).

test_packages.batsPackageToolTest.java (~25 s saved, delete BATS file)

PackageToolTest already exercises the full package lifecycle (list-available, add-repo, install, deploy, undeploy). Added testDeployValidationMessages() to cover the two specific BATS assertions not previously tested in JUnit: collection exists but package does not → correct error message; undeploy of never-deployed package → correct error message.

test_zk.bats — remove/replace unnecessary sleep calls (~7 s saved)

Removed six sleep 1 calls that were redundant: solr start already waits until Solr is ready before returning, and solr zk cp is synchronous. Replaced one sleep 1 after solr zk upconfig with wait_for that polls the configsets REST endpoint instead of relying on a fixed delay.

test_start_solr.bats — cap solr stop wait to 30 s (~150 s saved)

CI timing showed test 105 ("deprecated system properties converted to modern properties") taking 196 s. The culprit was solr stop --all in teardown() hanging until the default ~180 s timeout expired. Added SOLR_STOP_WAIT=30 so teardown gives up after 30 s instead of 180 s, saving ~150 s per occurrence.

test_example_noprompt.bats merged into test_example.bats (~78 s saved, delete BATS file)

test_example_noprompt.bats contained a single test (start -e cloud works with --no-prompt) that did a full two-node cloud startup, identical command to the existing test_example.bats test. Merged both into test_example.bats using setup_file/teardown_file so the two-node cloud is started once and both tests reuse it. Deletes test_example_noprompt.bats.

test_ssl.bats — replace fixed sleep 6 with wait_for polling in keystore reload test (~8 s saved)

"test keystore reload" (test 100, 75 s on CI) slept 6 s twice after each keystore file replacement to give Jetty time to detect the change via its file-scan interval. Replaced both with wait_for 30 1 solr healthcheck --solr-url https://localhost:${SOLR_PORT} so the test advances as soon as the reload completes rather than always paying 12 s. The 30 s cap keeps CI safe on slow boxes.

Testing

# JUnit
./gradlew :solr:core:test --tests "*.PackageToolTest"

janhoy added 6 commits May 30, 2026 20:31
The test downloads ONNX models from the network and its own comment
describes it as exploratory rather than a regular integration test.
Skip it by default; set OPENNLP_TESTS=true to opt in.
Previously each of the four tests that needed a running Solr started and
stopped their own instance, costing four separate Solr startups.  Switch
to a setup_file/teardown_file pair so all tests share a single startup.
The "No Solr nodes running" assertion is removed from the lifecycle test
since it is already verified by the suite-wide test_zz_cleanup.bats.
Both healthcheck tests previously used 'solr start -e films' which loads
and indexes the full films example dataset.  Replace with a plain
'solr start' plus 'solr create -c healthcheck_test -d _default' for the
cloud test, and a plain 'solr start --user-managed' for the standalone
test (which fails before any collection is needed).
PackageToolTest already tests the full lifecycle (list-available, add-repo,
install, deploy, undeploy).  Add testDeployValidationMessages() to cover
the two remaining BATS assertions:
- collection exists but package not found → "Package instance doesn't exist"
- undeploy of never-deployed package → "Package … not deployed on collection"

Delete test_packages.bats.
Add GzipCompressionTest (SolrCloudTestCase) to solr/core which verifies:
- Requests without Accept-Encoding get no Content-Encoding header
- Requests with Accept-Encoding: gzip get Content-Encoding: gzip

The minGzipSize Jetty property is lowered to 1 byte for the test cluster
so that any non-empty response body is eligible for compression.  Uses
the Jetty HttpClient already available via JettySolrRunner, consistent
with CacheHeaderTest and SecurityHeadersTest.

Delete test_compression.bats.
solr start already waits until Solr is ready before returning, so the
sleep 1 calls at the start of "listing out files" and "connecting to
solr via various solr urls" were unnecessary.

solr zk cp is synchronous; the three sleep 1 calls that followed it
before listing the copied file were also unnecessary.

The sleep 1 before the ZK_HOST env-var test had no purpose.

Replace the sleep 1 after solr zk upconfig with wait_for so we poll
until the configsets REST endpoint actually reflects the new config
instead of relying on a fixed delay.
janhoy added 2 commits May 30, 2026 20:49
Default SOLR_STOP_WAIT is ~180 s; on CI test 105 ("deprecated system
properties") was taking 196 s entirely because teardown waited for the
180 s timeout before giving up.  Set SOLR_STOP_WAIT=30 in teardown so
worst-case the stop waits 30 s, saving ~150 s per affected test.
…startup

Both tests ran 'solr start -e cloud --no-prompt'.  Merge them into a
single file with setup_file/teardown_file so the two-node cloud is
started once and both tests reuse it.  Deletes test_example_noprompt.bats.

Saves ~78 s on CI (one full two-node cloud startup removed).
@janhoy janhoy requested review from dsmiley and epugh May 30, 2026 18:55
janhoy added 2 commits May 30, 2026 21:00
…test

'test keystore reload' (test 100, 75 s on CI) slept 6 s twice to give
Jetty time to pick up the replaced keystore file.  Replace both with
wait_for so the test proceeds as soon as the reload completes rather
than always paying 12 s.  Timeout is 30 s to handle slow CI boxes.
More descriptive name that makes it clear the variable is specific to
the Solr BATS test suite.
@janhoy janhoy force-pushed the feature/slim-bats-tests branch from 9b91336 to 9aadafd Compare May 30, 2026 19:01
@janhoy
Copy link
Copy Markdown
Contributor Author

janhoy commented May 30, 2026

Estimated speedup was 10min, i.e. from 40 to 30 minutes. Reality for last run was

BUILD SUCCESSFUL in 24m 49s

So that's almost 50% improvement!

Now 10 slowest are:
image

Anyone see more bats tests that could be improved?

Copy link
Copy Markdown
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!!

SOLR_BATS_OPENNLP_TESTS=true

Sounds like we should have a general concept of "nightly" BATS. The Jenkins job for integration tests can also run these since these slower tests aren't yet too much for. The GitHub PR workflow for them shouldn't run them, however. A one-off obscure boolean effectively means this test is dead-weight forever (tests that never run are nothing but a maintenance burden).

test_compression.bats → GzipCompressionTest.java

I wrote this recently... and I'm flabbergasted that this is testable in our Java test infrastructure because our simpler JettySolrRunner doesn't configure production matters like Gzip nor the configuration in jetty XML files. This is deliberate.

Comment thread solr/core/src/test/org/apache/solr/servlet/GzipCompressionTest.java Outdated
Comment thread solr/core/src/test/org/apache/solr/servlet/GzipCompressionTest.java Outdated
@epugh
Copy link
Copy Markdown
Contributor

epugh commented May 31, 2026

An interesting things about the AI generated code comments is that I really like them to help me understand "this is the migraiton, this is what was before, this is how we do it now", but that the life span of that comment is that of the PR, once the PR is merged, I wish the comment disappeared. I think that some of the "self review" comments people put into PR's accomplish that same goal.

I also expect at some point that in a git diff you will someday see files that are added/changed/removed, and then a set of files that are marked to be deleted WHEN the PR is merged. I know I often have a notes.md file that accidentally gets committed that is my notes to myself when working on the code. It would be nice if we could label files as only lasting as long as the PR is open!

Comment thread solr/core/src/test/org/apache/solr/servlet/GzipCompressionTest.java Outdated
@janhoy janhoy marked this pull request as draft June 4, 2026 17:54

To include nightly tests, pass the Gradle property `solr.bats.nightly`:

./gradlew integrationTests -Psolr.bats.nightly=true
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsmiley wrote: Sounds like we should have a general concept of "nightly" BATS.

Did some research and wired in BATS concept of test tagging. So now we can tag slow tests as done for test_opennlp.bats and they will only run when gradle is invoked with -Psolr.bats.nightly=true.

Thus we can configure Jenkins with this prop...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have configured all integration test jobs in Jenkins with this Gradle property

@janhoy janhoy marked this pull request as ready for review June 4, 2026 18:47
solr start -e films
run solr healthcheck -c films --solr-connection http://localhost:${SOLR_PORT}/solr
# Remote
run solr healthcheck -c healthcheck_test --solr-connection http://localhost:${SOLR_PORT}/solr
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@epugh I merged the local and remote test in the same solr invocation

@janhoy
Copy link
Copy Markdown
Contributor Author

janhoy commented Jun 4, 2026

@epugh I think this is ready for merge now. Then we can do new improvement PRs later.

Copy link
Copy Markdown
Contributor

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.. I see the precommit failure, and checked it, and then saw it was brought by a different PR. SHIP IT!

@janhoy janhoy merged commit eaee0a2 into apache:main Jun 4, 2026
6 of 7 checks passed
@janhoy janhoy deleted the feature/slim-bats-tests branch June 4, 2026 22:51
@dsmiley
Copy link
Copy Markdown
Contributor

dsmiley commented Jun 5, 2026

Glad to see solr.bats.nightly=true but why didn't this PR actually tag our slowest integration tests (according to the bar chart)?

@janhoy
Copy link
Copy Markdown
Contributor Author

janhoy commented Jun 5, 2026

The focus for this PR was to rewrite tests to speed up as much as possible.

Then came the wish to remove or disable the opennlp test fully, which morphed into the tagging infra for vats tests.

As I answered epugh , we should do both. Make tests as fast as possible as well as moving certain kinds of tests to nightly.

The criteria for nightly could be discussed, and this PR was not aiming to do it all but to fix the worst and kickstart what will hopefully be more PRs?

@dsmiley
Copy link
Copy Markdown
Contributor

dsmiley commented Jun 5, 2026

ok... then strange to introduce solr.bats.nightly here as then it's off-topic to your topic of only making tests faster. Any way, no problem.

@dsmiley
Copy link
Copy Markdown
Contributor

dsmiley commented Jun 5, 2026

./gradlew :solr:core:test --tests "org.apache.solr.cli.PackageToolTest.testDeployValidationMessages" "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC -XX:ActiveProcessorCount=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=1857D9F7F40D1569 -Ptests.useSecurityManager=true -Ptests.file.encoding=ISO-8859-1

Claude Opus's analysis:

The new testDeployValidationMessages test is broken. It fails with AssertionError: Should report missing package.

Root cause: PackageManager.deploy() outputs the error message via printRed()SolrCLI.print()CLIO.out(), which writes directly to stdout. But the test captures output through CLITestHelper.TestingRuntime, which only captures calls to its own print()/println() methods. Since printRed() bypasses the TestingRuntime entirely, captureRuntime.getOutput() is empty and the assertion at line 405 fails.

Additionally, after printRed(), runtime.exit(1) throws RuntimeException("exit() not allowed in tests"), which ToolBase.runTool() catches and prints to CLIO.err() — also not captured by the runtime.

The fix would be to route error output in PackageManager through the ToolRuntime instead of the static CLIO.out().

@janhoy
Copy link
Copy Markdown
Contributor Author

janhoy commented Jun 6, 2026

Good catch. Refactored PackageManager to print thorugh runtime in #4502

@KhushJain
Copy link
Copy Markdown
Contributor

Noticed PackageToolTest.testDeployValidationMessages fails. @dsmiley can we get that PR merged.

Thanks for fixing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants