-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[health-check] fix performance issue and add extra enhacements #9871
[health-check] fix performance issue and add extra enhacements #9871
Conversation
cacbdeb
to
6c3e875
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, can you add just a brief description of what you're modifying in the Grafana board? Not sure if you're just moving/renaming stuff, or if you're removing/modifying the view.
FROM opensuse/leap:latest | ||
#FROM registry.suse.com/bci/python:3.11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ycedres did we settle in on the actual base images in the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think Yeray is on PTO, we can probably return to this later)
...ck/src/health_check/config/templates/grafana_dashboard/supportconfig_with_logs.template.json
Outdated
Show resolved
Hide resolved
...ck/src/health_check/config/templates/grafana_dashboard/supportconfig_with_logs.template.json
Outdated
Show resolved
Hide resolved
Remove references to Uyuni on CLI executions
@m-czernek I just addressed your comments from the PR review: About the changes made on the dashboards:
Here are some screeshots: (NOTE: the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM now - adding +1, but please also check whether readme doesn't need a change, since we're changing the run/clean
method names before merging. Sorry, didn't notice this before.
Additionally, I'm not sure we want to tackle this within this PR, but I have a few suggestions regarding the dashboard.
WDYT about:
- Removing:
- worker threads, socket pool size, timeout, and gather job timeout
- Java settings
Reason for this is that I'm not sure this is very helpful to viewers. What is helpful is the rules we have based on those values.
-
We might want to break out
num_of_channels
from miscellaneous as a separate counter (IMO this is interesting/important info to see, similar to num of CPUs and RAM). -
I wonder if we might make some transformation on
master/proxy/client
, where1
is essentiallytrue
and0
isfalse
. Note that we cannot modify these actual values since we use the1
and0
in math expressions in some rules (e.g. checking minimal requirements).But, we could either do a frontend transformation on the table, if possible, or do something else, e.g. create a new metric for front-end users. This way, it looks like the supportconfig has 1 server, 0 proxies and 0 clients, which might be a bit confusing.
@@ -60,7 +62,7 @@ def cli(ctx: click.Context, supportconfig_path: str, verbose: bool): | |||
callback=utils.validate_date, | |||
) | |||
@click.pass_context | |||
def run(ctx: click.Context, from_datetime: str, to_datetime: str, since: int): | |||
def start(ctx: click.Context, from_datetime: str, to_datetime: str, since: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this imply change in the docs? I.e. right now, we say to run health-check run ...
- do we have to change it to health-check start ...
, or is it just health-check ...
? Either way, we'll probably need to modify the readme, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, we need to change the README and wiki documentation.
I'll include the documentation changes and some of your suggestions in a follow-up PR.
FROM opensuse/leap:latest | ||
#FROM registry.suse.com/bci/python:3.11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think Yeray is on PTO, we can probably return to this later)
I think we should not remove worker threads, socket pool size, timeout, and gather job timeout metrics, neither the Java configuration values. IMO it is worth to have this complete and agreggated view of the Java configurations values in the dashboard. Of course, if we have alerts already implemented to indicate some possible known issue, that is great, and we should have as many as possible, but having the actual values displayed here I think helps when debugging to identify not known issues, where we don't have alerts already implemented. We can definetely consider using different panels to display those parameters though. |
I think the problem I have is that with some metrics, you can't even guess what the proper fully-qualified name is. With some metrics, like worker threads, this is not an issue. But with something like "timeout", this is useless - what kind of timeout? I added the Salt config in the beginning when I didn't have a clear idea of what I'm doing. The Java configs are similarly confusing. |
What does this PR change?
This PR fixes the performance issue found on supportconfigs due unexpected tons of Salt jobs. Additionally it does a couple of other fixes. See each commit individually.
NOTE: This PR is targetting
health-check-skeleton
feature branch.Changelogs
Make sure the changelogs entries you are adding are compliant with https://github.com/uyuni-project/uyuni/wiki/Contributing#changelogs and https://github.com/uyuni-project/uyuni/wiki/Contributing#uyuni-projectuyuni-repository
If you don't need a changelog check, please mark this checkbox:
If you uncheck the checkbox after the PR is created, you will need to re-run
changelog_test
(see below)Re-run a test
If you need to re-run a test, please mark the related checkbox, it will be unchecked automatically once it has re-run:
Before you merge
Check How to branch and merge properly!