Click refresh when minion is down #9865

maximenoel8 · 2025-02-27T22:01:17Z

What does this PR change?

Fixes an issue where the Salt service is sometimes down on the minion, causing failures when checking the event. In most cases, refreshing the event resolves the problem.

This PR introduces a default behavior to prevent flaky tests. If the "Minion down" message appears during the event check, the process will automatically click the refresh button and restart the event check. Since these actions are performed within repeat_until_timeout, there is no risk of an infinite loop (verified through testing).

GUI diff

No difference.

DONE

Documentation

No documentation needed: only internal and user invisible changes
DONE

Test coverage

ℹ️ If a major new functionality is added, it is strongly recommended that tests for the new functionality are added to the Cucumber test suite

No tests: already covered
DONE

Links

Port(s): # add downstream PR(s), if any

DONE

Changelogs

Make sure the changelogs entries you are adding are compliant with https://github.com/uyuni-project/uyuni/wiki/Contributing#changelogs and https://github.com/uyuni-project/uyuni/wiki/Contributing#uyuni-projectuyuni-repository

If you don't need a changelog check, please mark this checkbox:

No changelog needed

If you uncheck the checkbox after the PR is created, you will need to re-run changelog_test (see below)

Re-run a test

If you need to re-run a test, please mark the related checkbox, it will be unchecked automatically once it has re-run:

Re-run test "changelog_test"
Re-run test "backend_unittests_pgsql"
Re-run test "java_pgsql_tests"
Re-run test "schema_migration_test_pgsql"
Re-run test "susemanager_unittests"
Re-run test "javascript_lint"
Re-run test "spacecmd_unittests"

Before you merge

Check How to branch and merge properly!

github-actions · 2025-02-27T22:01:28Z

👋 Hello! Thanks for contributing to our project.
Acceptance tests will take some time (aprox. 1h), please be patient ☕

You can see the progress at the end of this page and at https://github.com/uyuni-project/uyuni/pull/9865/checks
Once tests finish, if they fail, you can check 👀 the cucumber report. See the link at the output of the action.
You can also check the artifacts section, which contains the logs at https://github.com/uyuni-project/uyuni/pull/9865/checks.

If you are unsure the failing tests are related to your code, you can check the "reference jobs". These are jobs that run on a scheduled time with code from master. If they fail for the same reason as your build, it means the tests or the infrastructure are broken. If they do not fail, but yours do, it means it is related to your code.

Reference tests:

KNOWN ISSUES

Sometimes the build can fail when pulling new jar files from download.opensuse.org . This is a known limitation. Given this happens rarely, when it does, all you need to do is rerun the test. Sorry for the inconvenience.

For more tips on troubleshooting, see the troubleshooting guide.

Happy hacking!
⚠️ You should not merge if acceptance tests fail to pass. ⚠️

srbarrios

I don't like this approach because this could hide issues.
That's why we try to don't use re-tries anywhere, so we enforce reliability and a deterministic behavor on the product side.

srbarrios · 2025-02-28T08:30:56Z

testsuite/features/step_definitions/navigation_steps.rb

+
+    if has_content?('Minion is down or could not be contacted.', wait: 3)
+      find(:xpath, "//input[@value='Reschedule']").click
+      step %(I wait 30 seconds until the event is picked up and #{timeout} seconds until the event "#{event}" is completed)
+    end


Suggested change

if has_content?('Minion is down or could not be contacted.', wait: 3)

find(:xpath, "//input[@value='Reschedule']").click

step %(I wait 30 seconds until the event is picked up and #{timeout} seconds until the event "#{event}" is completed)

end

minion_down_str = 'Minion is down or could not be contacted.'

raise SystemCallError, minion_down_str if has_content?(minion_down_str, wait: 0)

Bischoff

This "Minion is down" is most probably a product bug that should be fixed.

The solution is not to hide the problem in the test suite.

meaksh

I'm also not much fan of this approach, as it should not be expected that the minion is down and the action is not scheduled. Even if this change is improving CI stability, it might hide potential issues.

We should probably understand why in some cases the minions are down in our CI environment.

In the past, this helped us to detect some actual problems in the product.

For instance, in the past we detected that minion startup can be stuck for a bunch of seconds, making the minion unresponsive during that time, while calculating the FQDNS grains in environments with a wrong DNS setup. If an action is scheduled within this time, it fails with "'Minion is down or could not be contacted" message.

We should probably debug why in some cases we get this now. It could be that the minion startup is taking long that expected, or that CI is scheduling actions too fast after a minion is restarted by some previous action (timing issue).

I would try to go in this direction.

srbarrios · 2025-02-28T12:21:43Z

If we don't have a card, we might need want for this issue, and a bug report if it differs from these others:

Bug 1182851 is supposed to be fixed by the move to the Salt bundle.
Bug 1192510 is about systems that reboot and don't process Salt events during that time. We could be hit by a reincarnation of that one.
Bug 1159492 was about "Valid metadata not found at specified URL". I don't think we see that message anymore.
Bug 1057870 was again about reboot – this time with pending jobs lost due to the reboot.
Bug 1207869 was due to high load caused by a customer script using our API.
Bug 1196081 is still open; it's our 4.3 formulas making the proxy non-operational from a Salt point of view for 10 minutes.
Bug 1172282 was one of our scripts overloading the Salt queue with both system upgrades and patch applications. In addition, there was again a problem due to reboot.
Bug 1159092 was a scalability problem fixed in the meantime.
Bug 1163965 was due to bad configuration on the customer side.
Bug 1045381 was for CaaS.

maximenoel8 requested a review from a team as a code owner February 27, 2025 22:01

github-actions bot added testing ruby_rubocop test-framework labels Feb 27, 2025

Click refresh when minion is down

5c98807

maximenoel8 force-pushed the fix_minion_down_issue branch from 8c5287e to 5c98807 Compare February 27, 2025 22:12

ktsamis approved these changes Feb 27, 2025

View reviewed changes

srbarrios requested changes Feb 28, 2025

View reviewed changes

srbarrios reviewed Feb 28, 2025

View reviewed changes

Bischoff self-requested a review February 28, 2025 09:45

Bischoff requested changes Feb 28, 2025

View reviewed changes

meaksh reviewed Feb 28, 2025

View reviewed changes

maximenoel8 closed this Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Click refresh when minion is down #9865

Click refresh when minion is down #9865

maximenoel8 commented Feb 27, 2025 •

edited

Loading

github-actions bot commented Feb 27, 2025

srbarrios left a comment •

edited

Loading

srbarrios Feb 28, 2025 •

edited

Loading

Bischoff left a comment

meaksh left a comment

srbarrios commented Feb 28, 2025

Click refresh when minion is down #9865

Click refresh when minion is down #9865

Conversation

maximenoel8 commented Feb 27, 2025 • edited Loading

What does this PR change?

GUI diff

Documentation

Test coverage

Links

Changelogs

Re-run a test

Before you merge

github-actions bot commented Feb 27, 2025

srbarrios left a comment • edited Loading

Choose a reason for hiding this comment

srbarrios Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

Bischoff left a comment

Choose a reason for hiding this comment

meaksh left a comment

Choose a reason for hiding this comment

srbarrios commented Feb 28, 2025

maximenoel8 commented Feb 27, 2025 •

edited

Loading

srbarrios left a comment •

edited

Loading

srbarrios Feb 28, 2025 •

edited

Loading