-
Notifications
You must be signed in to change notification settings - Fork 601
OCPBUGS-57456: podman-etcd should keep the container for crash debugging #2062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OCPBUGS-57456: podman-etcd should keep the container for crash debugging #2062
Conversation
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/1/input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definetely seems like a step forward for debugging purposes. I am concerned about ending up with an infinite list of old containers though. Is is possible to just keep the previous and current ones?
The existing container is deleted right before starting a new one. I could improve it to keep the last one only |
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/2/input |
As this PR also introduces configuration files for podman (env.yaml) and etcd (config.yaml), I also need to backup those files together with the podman container, otherwise we will be unable to check what data etcd is started with. /hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the new file-based configuration options. It think it makes it easier to follow along to where the configuration options are being sourced and propagated. Excited to also have these logs preserved.
heartbeat/podman-etcd
Outdated
FORCE_NEW_CLUSTER=false | ||
fi | ||
|
||
cat > "$OCF_RESKEY_podman_env_file" << EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the collection of commented-out options, are these things we're leaving here because we expect to enable them down the line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those were leftovers I forgot to remove. However I noticed that some env variables (e.g. etcd_data) were not correctly used, and, thinking more about it, the env file for podman is not necessary anyway, as the only thing that can really change is the etcd command line. This to say that I need to push a version without podman env file :)
48e121c
to
d5da5f6
Compare
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/3/input |
d5da5f6
to
e6f68d0
Compare
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/4/input |
e6f68d0
to
4c021cc
Compare
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/5/input |
4c021cc
to
1353f02
Compare
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/6/input |
Let's keep it on hold again. Notice this line from etcd that should be investigated more
|
/hold |
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/7/input |
This change modifies the agent to keep stopped containers for log inspection and debugging, with supporting changes to enable this behavior. * Conditionally reuse existing containers when configuration unchanged * Move etcd inline configuration flags to external file to allow restarts without container recreation (mainly for the force-new-cluster flag) * Archive previous container, and configuration, as *-previous before replacement. Only one copy is maintained to limit disk usage. Signed-off-by: Carlo Lobrano <[email protected]>
|
||
{ | ||
echo "cipher-suites:" | ||
IFS=',' read -ra cipher_array <<< "$ETCD_CIPHER_SUITES" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<<<
is bash-specific, so you should avoid it if you can. If not we'll have to rename it to .in, update the hashbang to #!@BASH_SHELL@
and add it to "configure.ac" to auto-update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I'll try to find an alternative
d886489
to
c353e06
Compare
Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/8/input |
# Archive corresponding etcd configuration file | ||
etcd_configuration_file_previous=${OCF_RESKEY_etcd_configuration_file//".yaml"/"-previous.yaml"} | ||
if ocf_run cp "$OCF_RESKEY_etcd_configuration_file" "$etcd_configuration_file_previous"; then | ||
return "$OCF_SUCCESS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about this return - aren't we also planning to archive the old logs? Is it sufficient to just have the old yaml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, don't we need to remove the container still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which logs? Aren't the logs inside the container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldnt it be a lot easier to mount a log volume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the output of podman logs {container}
I'm pretty sure they are stored in /var/log/container
by default, but they are deleted when then container is deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait I lost the context here. You're saying that this line: https://github.com/ClusterLabs/resource-agents/pull/2062/files#diff-2600c60d13bcbda50a7ac6542e535b71029aabd73ddd86e188ea997508e1e6e7R345
Renames the container, which in turn will ensure that podman saves its log record at the previous location. That seems right, ignore the comment above :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow I saw the cp
command but missing the rename
command above it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw etcd
logs also in pacemaker's journactl. I thought we wanted to keep the whole container for later inspection.
if is_force_new_cluster; then | ||
# The embedded newline is required for correct YAML formatting. | ||
FORCE_NEW_CLUSTER_CONFIG="force-new-cluster: true | ||
force-new-cluster-bump-amount: 1000000000" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something we left out initially. It's part of the restore-quorum script in CEO. The script is more advanced, calculating the bump automatically. I had a simpler change for testing, which I forgot to complete.
# NOTE: the pod manifest contains some values referred to its REVISION that we want to ignore | ||
jq_filter='del(.metadata.labels.revision) | .spec.containers[] |= ( .env |= map(select( .name != "ETCD_STATIC_POD_VERSION" ))) | .spec.volumes |= map( select( .name != "resource-dir" ))' | ||
|
||
ocf_run diff -s <(jq "$jq_filter" "$OCF_RESKEY_pod_manifest") <(jq "$jq_filter" "$OCF_RESKEY_pod_manifest_copy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this print out the diff so we can see it in the runtime log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
@@ -1206,6 +1393,7 @@ podman_start() | |||
fi | |||
;; | |||
2) | |||
# TODO: can we start "normally", regardless the revisions, if the container-id is the same on both nodes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the revision we're referring to is the revision of the static pod yaml generated by CEO, I'm pretty sure the answer is yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but since this was a different story, I decided it was best to at least leave a message for the future :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
This change modifies the agent to keep stopped containers for log inspection and debugging, with supporting changes to enable this behavior.