Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[health-check] fix performance issue and add extra enhacements #9871

Merged
merged 15 commits into from
Mar 3, 2025

Conversation

meaksh
Copy link
Member

@meaksh meaksh commented Feb 28, 2025

What does this PR change?

This PR fixes the performance issue found on supportconfigs due unexpected tons of Salt jobs. Additionally it does a couple of other fixes. See each commit individually.

NOTE: This PR is targetting health-check-skeleton feature branch.

Changelogs

Make sure the changelogs entries you are adding are compliant with https://github.com/uyuni-project/uyuni/wiki/Contributing#changelogs and https://github.com/uyuni-project/uyuni/wiki/Contributing#uyuni-projectuyuni-repository

If you don't need a changelog check, please mark this checkbox:

  • No changelog needed

If you uncheck the checkbox after the PR is created, you will need to re-run changelog_test (see below)

Re-run a test

If you need to re-run a test, please mark the related checkbox, it will be unchecked automatically once it has re-run:

  • Re-run test "changelog_test"
  • Re-run test "backend_unittests_pgsql"
  • Re-run test "java_pgsql_tests"
  • Re-run test "schema_migration_test_pgsql"
  • Re-run test "susemanager_unittests"
  • Re-run test "javascript_lint"
  • Re-run test "spacecmd_unittests"

Before you merge

Check How to branch and merge properly!

@meaksh meaksh requested a review from a team as a code owner February 28, 2025 17:25
@meaksh meaksh requested review from agraul and removed request for a team February 28, 2025 17:25
@meaksh meaksh force-pushed the health-check-skeleton-extra-fixes branch from cacbdeb to 6c3e875 Compare February 28, 2025 17:28
Copy link
Contributor

@m-czernek m-czernek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, can you add just a brief description of what you're modifying in the Grafana board? Not sure if you're just moving/renaming stuff, or if you're removing/modifying the view.

Comment on lines 1 to -2
FROM opensuse/leap:latest
#FROM registry.suse.com/bci/python:3.11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ycedres did we settle in on the actual base images in the end?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think Yeray is on PTO, we can probably return to this later)

@meaksh
Copy link
Member Author

meaksh commented Mar 3, 2025

@m-czernek I just addressed your comments from the PR review:

About the changes made on the dashboards:

  • Refactor Salt Keys panels
  • Add Java configuration metrics
  • Add "misc" and "hw" metrics
  • Adjust panels to fit entire windows width
  • Rename sections/rows titles

Here are some screeshots:

Captura desde 2025-03-03 10-11-30
Captura desde 2025-03-03 10-11-43

(NOTE: the Uyuni Server and Salt row/title visible in the screenshots has been renamed to Server configuration and Salt)

Copy link
Contributor

@m-czernek m-czernek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM now - adding +1, but please also check whether readme doesn't need a change, since we're changing the run/clean method names before merging. Sorry, didn't notice this before.

Additionally, I'm not sure we want to tackle this within this PR, but I have a few suggestions regarding the dashboard.

WDYT about:

  • Removing:
    • worker threads, socket pool size, timeout, and gather job timeout
    • Java settings

Reason for this is that I'm not sure this is very helpful to viewers. What is helpful is the rules we have based on those values.

  • We might want to break out num_of_channels from miscellaneous as a separate counter (IMO this is interesting/important info to see, similar to num of CPUs and RAM).

  • I wonder if we might make some transformation on master/proxy/client, where 1 is essentially true and 0 is false. Note that we cannot modify these actual values since we use the 1 and 0 in math expressions in some rules (e.g. checking minimal requirements).

    But, we could either do a frontend transformation on the table, if possible, or do something else, e.g. create a new metric for front-end users. This way, it looks like the supportconfig has 1 server, 0 proxies and 0 clients, which might be a bit confusing.

@@ -60,7 +62,7 @@ def cli(ctx: click.Context, supportconfig_path: str, verbose: bool):
callback=utils.validate_date,
)
@click.pass_context
def run(ctx: click.Context, from_datetime: str, to_datetime: str, since: int):
def start(ctx: click.Context, from_datetime: str, to_datetime: str, since: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply change in the docs? I.e. right now, we say to run health-check run ... - do we have to change it to health-check start ..., or is it just health-check ...? Either way, we'll probably need to modify the readme, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, we need to change the README and wiki documentation.

I'll include the documentation changes and some of your suggestions in a follow-up PR.

Comment on lines 1 to -2
FROM opensuse/leap:latest
#FROM registry.suse.com/bci/python:3.11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I think Yeray is on PTO, we can probably return to this later)

@meaksh meaksh merged commit 34bd9f1 into health-check-skeleton Mar 3, 2025
10 of 17 checks passed
@meaksh meaksh deleted the health-check-skeleton-extra-fixes branch March 3, 2025 14:02
@meaksh
Copy link
Member Author

meaksh commented Mar 3, 2025

WDYT about:

  • Removing:

    • worker threads, socket pool size, timeout, and gather job timeout
    • Java settings

Reason for this is that I'm not sure this is very helpful to viewers. [...]

I think we should not remove worker threads, socket pool size, timeout, and gather job timeout metrics, neither the Java configuration values. IMO it is worth to have this complete and agreggated view of the Java configurations values in the dashboard.

Of course, if we have alerts already implemented to indicate some possible known issue, that is great, and we should have as many as possible, but having the actual values displayed here I think helps when debugging to identify not known issues, where we don't have alerts already implemented.

We can definetely consider using different panels to display those parameters though.

@m-czernek
Copy link
Contributor

I think the problem I have is that with some metrics, you can't even guess what the proper fully-qualified name is. With some metrics, like worker threads, this is not an issue. But with something like "timeout", this is useless - what kind of timeout?

I added the Salt config in the beginning when I didn't have a clear idea of what I'm doing. The Java configs are similarly confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants