Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Click refresh when minion is down #9865

Closed

Conversation

maximenoel8
Copy link
Contributor

@maximenoel8 maximenoel8 commented Feb 27, 2025

What does this PR change?

Fixes an issue where the Salt service is sometimes down on the minion, causing failures when checking the event. In most cases, refreshing the event resolves the problem.

This PR introduces a default behavior to prevent flaky tests. If the "Minion down" message appears during the event check, the process will automatically click the refresh button and restart the event check. Since these actions are performed within repeat_until_timeout, there is no risk of an infinite loop (verified through testing).

GUI diff

No difference.

  • DONE

Documentation

  • No documentation needed: only internal and user invisible changes

  • DONE

Test coverage

ℹ️ If a major new functionality is added, it is strongly recommended that tests for the new functionality are added to the Cucumber test suite

  • No tests: already covered

  • DONE

Links

Port(s): # add downstream PR(s), if any

  • DONE

Changelogs

Make sure the changelogs entries you are adding are compliant with https://github.com/uyuni-project/uyuni/wiki/Contributing#changelogs and https://github.com/uyuni-project/uyuni/wiki/Contributing#uyuni-projectuyuni-repository

If you don't need a changelog check, please mark this checkbox:

  • No changelog needed

If you uncheck the checkbox after the PR is created, you will need to re-run changelog_test (see below)

Re-run a test

If you need to re-run a test, please mark the related checkbox, it will be unchecked automatically once it has re-run:

  • Re-run test "changelog_test"
  • Re-run test "backend_unittests_pgsql"
  • Re-run test "java_pgsql_tests"
  • Re-run test "schema_migration_test_pgsql"
  • Re-run test "susemanager_unittests"
  • Re-run test "javascript_lint"
  • Re-run test "spacecmd_unittests"

Before you merge

Check How to branch and merge properly!

@maximenoel8 maximenoel8 requested a review from a team as a code owner February 27, 2025 22:01
Copy link
Contributor

👋 Hello! Thanks for contributing to our project.
Acceptance tests will take some time (aprox. 1h), please be patient ☕

You can see the progress at the end of this page and at https://github.com/uyuni-project/uyuni/pull/9865/checks
Once tests finish, if they fail, you can check 👀 the cucumber report. See the link at the output of the action.
You can also check the artifacts section, which contains the logs at https://github.com/uyuni-project/uyuni/pull/9865/checks.

If you are unsure the failing tests are related to your code, you can check the "reference jobs". These are jobs that run on a scheduled time with code from master. If they fail for the same reason as your build, it means the tests or the infrastructure are broken. If they do not fail, but yours do, it means it is related to your code.

Reference tests:

KNOWN ISSUES

Sometimes the build can fail when pulling new jar files from download.opensuse.org . This is a known limitation. Given this happens rarely, when it does, all you need to do is rerun the test. Sorry for the inconvenience.

For more tips on troubleshooting, see the troubleshooting guide.

Happy hacking!
⚠️ You should not merge if acceptance tests fail to pass. ⚠️

@maximenoel8 maximenoel8 force-pushed the fix_minion_down_issue branch from 8c5287e to 5c98807 Compare February 27, 2025 22:12
Copy link
Member

@srbarrios srbarrios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this approach because this could hide issues.
That's why we try to don't use re-tries anywhere, so we enforce reliability and a deterministic behavor on the product side.

Comment on lines +90 to +94

if has_content?('Minion is down or could not be contacted.', wait: 3)
find(:xpath, "//input[@value='Reschedule']").click
step %(I wait 30 seconds until the event is picked up and #{timeout} seconds until the event "#{event}" is completed)
end
Copy link
Member

@srbarrios srbarrios Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if has_content?('Minion is down or could not be contacted.', wait: 3)
find(:xpath, "//input[@value='Reschedule']").click
step %(I wait 30 seconds until the event is picked up and #{timeout} seconds until the event "#{event}" is completed)
end
minion_down_str = 'Minion is down or could not be contacted.'
raise SystemCallError, minion_down_str if has_content?(minion_down_str, wait: 0)

@Bischoff Bischoff self-requested a review February 28, 2025 09:45
Copy link
Contributor

@Bischoff Bischoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "Minion is down" is most probably a product bug that should be fixed.

The solution is not to hide the problem in the test suite.

Copy link
Member

@meaksh meaksh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not much fan of this approach, as it should not be expected that the minion is down and the action is not scheduled. Even if this change is improving CI stability, it might hide potential issues.

We should probably understand why in some cases the minions are down in our CI environment.

In the past, this helped us to detect some actual problems in the product.

For instance, in the past we detected that minion startup can be stuck for a bunch of seconds, making the minion unresponsive during that time, while calculating the FQDNS grains in environments with a wrong DNS setup. If an action is scheduled within this time, it fails with "'Minion is down or could not be contacted" message.

We should probably debug why in some cases we get this now. It could be that the minion startup is taking long that expected, or that CI is scheduling actions too fast after a minion is restarted by some previous action (timing issue).

I would try to go in this direction.

@srbarrios
Copy link
Member

If we don't have a card, we might need want for this issue, and a bug report if it differs from these others:

  • Bug 1182851 is supposed to be fixed by the move to the Salt bundle.
  • Bug 1192510 is about systems that reboot and don't process Salt events during that time. We could be hit by a reincarnation of that one.
  • Bug 1159492 was about "Valid metadata not found at specified URL". I don't think we see that message anymore.
  • Bug 1057870 was again about reboot – this time with pending jobs lost due to the reboot.
  • Bug 1207869 was due to high load caused by a customer script using our API.
  • Bug 1196081 is still open; it's our 4.3 formulas making the proxy non-operational from a Salt point of view for 10 minutes.
  • Bug 1172282 was one of our scripts overloading the Salt queue with both system upgrades and patch applications. In addition, there was again a problem due to reboot.
  • Bug 1159092 was a scalability problem fixed in the meantime.
  • Bug 1163965 was due to bad configuration on the customer side.
  • Bug 1045381 was for CaaS.

@maximenoel8 maximenoel8 closed this Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants