Speed up BATS tests#4478
Conversation
The test downloads ONNX models from the network and its own comment describes it as exploratory rather than a regular integration test. Skip it by default; set OPENNLP_TESTS=true to opt in.
Previously each of the four tests that needed a running Solr started and stopped their own instance, costing four separate Solr startups. Switch to a setup_file/teardown_file pair so all tests share a single startup. The "No Solr nodes running" assertion is removed from the lifecycle test since it is already verified by the suite-wide test_zz_cleanup.bats.
Both healthcheck tests previously used 'solr start -e films' which loads and indexes the full films example dataset. Replace with a plain 'solr start' plus 'solr create -c healthcheck_test -d _default' for the cloud test, and a plain 'solr start --user-managed' for the standalone test (which fails before any collection is needed).
PackageToolTest already tests the full lifecycle (list-available, add-repo, install, deploy, undeploy). Add testDeployValidationMessages() to cover the two remaining BATS assertions: - collection exists but package not found → "Package instance doesn't exist" - undeploy of never-deployed package → "Package … not deployed on collection" Delete test_packages.bats.
Add GzipCompressionTest (SolrCloudTestCase) to solr/core which verifies: - Requests without Accept-Encoding get no Content-Encoding header - Requests with Accept-Encoding: gzip get Content-Encoding: gzip The minGzipSize Jetty property is lowered to 1 byte for the test cluster so that any non-empty response body is eligible for compression. Uses the Jetty HttpClient already available via JettySolrRunner, consistent with CacheHeaderTest and SecurityHeadersTest. Delete test_compression.bats.
solr start already waits until Solr is ready before returning, so the sleep 1 calls at the start of "listing out files" and "connecting to solr via various solr urls" were unnecessary. solr zk cp is synchronous; the three sleep 1 calls that followed it before listing the copied file were also unnecessary. The sleep 1 before the ZK_HOST env-var test had no purpose. Replace the sleep 1 after solr zk upconfig with wait_for so we poll until the configsets REST endpoint actually reflects the new config instead of relying on a fixed delay.
Default SOLR_STOP_WAIT is ~180 s; on CI test 105 ("deprecated system
properties") was taking 196 s entirely because teardown waited for the
180 s timeout before giving up. Set SOLR_STOP_WAIT=30 in teardown so
worst-case the stop waits 30 s, saving ~150 s per affected test.
…startup Both tests ran 'solr start -e cloud --no-prompt'. Merge them into a single file with setup_file/teardown_file so the two-node cloud is started once and both tests reuse it. Deletes test_example_noprompt.bats. Saves ~78 s on CI (one full two-node cloud startup removed).
…test 'test keystore reload' (test 100, 75 s on CI) slept 6 s twice to give Jetty time to pick up the replaced keystore file. Replace both with wait_for so the test proceeds as soon as the reload completes rather than always paying 12 s. Timeout is 30 s to handle slow CI boxes.
More descriptive name that makes it clear the variable is specific to the Solr BATS test suite.
9b91336 to
9aadafd
Compare
dsmiley
left a comment
There was a problem hiding this comment.
Thanks for doing this!!
SOLR_BATS_OPENNLP_TESTS=true
Sounds like we should have a general concept of "nightly" BATS. The Jenkins job for integration tests can also run these since these slower tests aren't yet too much for. The GitHub PR workflow for them shouldn't run them, however. A one-off obscure boolean effectively means this test is dead-weight forever (tests that never run are nothing but a maintenance burden).
test_compression.bats → GzipCompressionTest.java
I wrote this recently... and I'm flabbergasted that this is testable in our Java test infrastructure because our simpler JettySolrRunner doesn't configure production matters like Gzip nor the configuration in jetty XML files. This is deliberate.
|
An interesting things about the AI generated code comments is that I really like them to help me understand "this is the migraiton, this is what was before, this is how we do it now", but that the life span of that comment is that of the PR, once the PR is merged, I wish the comment disappeared. I think that some of the "self review" comments people put into PR's accomplish that same goal. I also expect at some point that in a |
|
|
||
| To include nightly tests, pass the Gradle property `solr.bats.nightly`: | ||
|
|
||
| ./gradlew integrationTests -Psolr.bats.nightly=true |
There was a problem hiding this comment.
@dsmiley wrote: Sounds like we should have a general concept of "nightly" BATS.
Did some research and wired in BATS concept of test tagging. So now we can tag slow tests as done for test_opennlp.bats and they will only run when gradle is invoked with -Psolr.bats.nightly=true.
Thus we can configure Jenkins with this prop...
There was a problem hiding this comment.
I have configured all integration test jobs in Jenkins with this Gradle property
| solr start -e films | ||
| run solr healthcheck -c films --solr-connection http://localhost:${SOLR_PORT}/solr | ||
| # Remote | ||
| run solr healthcheck -c healthcheck_test --solr-connection http://localhost:${SOLR_PORT}/solr |
There was a problem hiding this comment.
@epugh I merged the local and remote test in the same solr invocation
|
@epugh I think this is ready for merge now. Then we can do new improvement PRs later. |
epugh
left a comment
There was a problem hiding this comment.
LGTM.. I see the precommit failure, and checked it, and then saw it was brought by a different PR. SHIP IT!
|
Glad to see |
|
The focus for this PR was to rewrite tests to speed up as much as possible. Then came the wish to remove or disable the opennlp test fully, which morphed into the tagging infra for vats tests. As I answered epugh , we should do both. Make tests as fast as possible as well as moving certain kinds of tests to nightly. The criteria for nightly could be discussed, and this PR was not aiming to do it all but to fix the worst and kickstart what will hopefully be more PRs? |
|
ok... then strange to introduce solr.bats.nightly here as then it's off-topic to your topic of only making tests faster. Any way, no problem. |
Claude Opus's analysis:The new Root cause: Additionally, after The fix would be to route error output in |
|
Good catch. Refactored PackageManager to print thorugh |
|
Noticed Thanks for fixing it! |

The BATS test suite was taking longer and longer, now about 40 mins on GH CI. This PR makes targeted, low-risk reductions without removing meaningful coverage.
Total estimated CI time saving: ~10 minutes
Slowest tests are these

Changes
DISCLAIMER All improvements made by Claude Code LLM, be aware of rough edges
BATS nightly test filtering
Uses BATS 1.8.2's native tag support to gate slow/network-dependent tests instead of
manual
if/skipblocks inside test code.Tests tagged with
# bats test_tags=nightlyare excluded by default. To include them:./gradlew integrationTests -Psolr.bats.nightly=true
Gate OpenNLP test behind the new
test_tags=nightlytagSince this test downloads external data and is slow / experimental
test_status.bats— one shared Solr startup (~200 s saved)Four of six tests independently started and stopped their own Solr instance. Restructured with
setup_file/teardown_fileso all tests share a single startup (~3 Solr startups saved).CI timing showed "status with --short format" at 197 s — almost certainly caused by
solr stopin teardown hanging when each test managed its own Solr lifecycle. The new structure uses a singleteardown_filestop which eliminates this.test_healthcheck.bats— cheaper collection setup (~15 s saved)Both tests used
solr start -e films, which loads and indexes the full films dataset. Replaced withsolr start+solr create -c healthcheck_test -d _defaultfor the cloud test, and plainsolr start --user-managedfor the standalone test (which fails before any collection is needed).test_packages.bats→PackageToolTest.java(~25 s saved, delete BATS file)PackageToolTestalready exercises the full package lifecycle (list-available, add-repo, install, deploy, undeploy). AddedtestDeployValidationMessages()to cover the two specific BATS assertions not previously tested in JUnit: collection exists but package does not → correct error message; undeploy of never-deployed package → correct error message.test_zk.bats— remove/replace unnecessarysleepcalls (~7 s saved)Removed six
sleep 1calls that were redundant:solr startalready waits until Solr is ready before returning, andsolr zk cpis synchronous. Replaced onesleep 1aftersolr zk upconfigwithwait_forthat polls the configsets REST endpoint instead of relying on a fixed delay.test_start_solr.bats— capsolr stopwait to 30 s (~150 s saved)CI timing showed test 105 ("deprecated system properties converted to modern properties") taking 196 s. The culprit was
solr stop --allinteardown()hanging until the default ~180 s timeout expired. AddedSOLR_STOP_WAIT=30so teardown gives up after 30 s instead of 180 s, saving ~150 s per occurrence.test_example_noprompt.batsmerged intotest_example.bats(~78 s saved, delete BATS file)test_example_noprompt.batscontained a single test (start -e cloud works with --no-prompt) that did a full two-node cloud startup, identical command to the existingtest_example.batstest. Merged both intotest_example.batsusingsetup_file/teardown_fileso the two-node cloud is started once and both tests reuse it. Deletestest_example_noprompt.bats.test_ssl.bats— replace fixedsleep 6withwait_forpolling in keystore reload test (~8 s saved)"test keystore reload" (test 100, 75 s on CI) slept 6 s twice after each keystore file replacement to give Jetty time to detect the change via its file-scan interval. Replaced both with
wait_for 30 1 solr healthcheck --solr-url https://localhost:${SOLR_PORT}so the test advances as soon as the reload completes rather than always paying 12 s. The 30 s cap keeps CI safe on slow boxes.Testing