Skip to content

Conversation

clobrano
Copy link
Collaborator

@clobrano clobrano commented Jul 23, 2025

This change modifies the agent to keep stopped containers for log inspection and debugging, with supporting changes to enable this behavior.

  • Conditionally reuse existing containers when configuration unchanged
  • Move etcd inline configuration flags to external file to allow restarts without container recreation (mainly for the force-new-cluster flag)
  • Archive previous container, and configuration, as *-previous before replacement. Only one copy is maintained to limit disk usage.

Copy link

knet-jenkins bot commented Jul 23, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/1/input

Copy link

@jaypoulz jaypoulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definetely seems like a step forward for debugging purposes. I am concerned about ending up with an infinite list of old containers though. Is is possible to just keep the previous and current ones?

@clobrano
Copy link
Collaborator Author

The existing container is deleted right before starting a new one. I could improve it to keep the last one only

Copy link

knet-jenkins bot commented Jul 29, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/2/input

@clobrano clobrano changed the title [WIP] OCPBUGS-57456: podman-etcd should keep the container for crash debugging OCPBUGS-57456: podman-etcd should keep the container for crash debugging Jul 30, 2025
@clobrano clobrano marked this pull request as ready for review July 30, 2025 07:33
@clobrano
Copy link
Collaborator Author

clobrano commented Aug 1, 2025

As this PR also introduces configuration files for podman (env.yaml) and etcd (config.yaml), I also need to backup those files together with the podman container, otherwise we will be unable to check what data etcd is started with.

/hold

@clobrano clobrano marked this pull request as draft August 1, 2025 09:15
Copy link

@jaypoulz jaypoulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the new file-based configuration options. It think it makes it easier to follow along to where the configuration options are being sourced and propagated. Excited to also have these logs preserved.

FORCE_NEW_CLUSTER=false
fi

cat > "$OCF_RESKEY_podman_env_file" << EOF
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the collection of commented-out options, are these things we're leaving here because we expect to enable them down the line?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those were leftovers I forgot to remove. However I noticed that some env variables (e.g. etcd_data) were not correctly used, and, thinking more about it, the env file for podman is not necessary anyway, as the only thing that can really change is the etcd command line. This to say that I need to push a version without podman env file :)

@clobrano clobrano force-pushed the ocpbugs-57456-tnf-podman-etcd-keep-container-for-crash-debugging branch from 48e121c to d5da5f6 Compare August 12, 2025 09:52
Copy link

knet-jenkins bot commented Aug 12, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/3/input

@clobrano clobrano force-pushed the ocpbugs-57456-tnf-podman-etcd-keep-container-for-crash-debugging branch from d5da5f6 to e6f68d0 Compare August 12, 2025 10:06
Copy link

knet-jenkins bot commented Aug 12, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/4/input

@clobrano clobrano force-pushed the ocpbugs-57456-tnf-podman-etcd-keep-container-for-crash-debugging branch from e6f68d0 to 4c021cc Compare August 12, 2025 14:25
Copy link

knet-jenkins bot commented Aug 12, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/5/input

@clobrano clobrano force-pushed the ocpbugs-57456-tnf-podman-etcd-keep-container-for-crash-debugging branch from 4c021cc to 1353f02 Compare August 12, 2025 14:26
@clobrano clobrano marked this pull request as ready for review August 12, 2025 14:27
Copy link

knet-jenkins bot commented Aug 12, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/6/input

@clobrano
Copy link
Collaborator Author

Let's keep it on hold again. Notice this line from etcd that should be investigated more

{"level":"info","ts":"2025-08-14T12:14:04.430103Z","caller":"etcdmain/config.go:370","msg":"loaded server configuration, other configuration command line flags and environment variables will be ignored if provided","path":"/var/lib/etcd/config.yaml"}

@clobrano
Copy link
Collaborator Author

/hold

Copy link

knet-jenkins bot commented Aug 27, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/7/input

This change modifies the agent to keep stopped containers for log
inspection and debugging, with supporting changes to enable this
behavior.

* Conditionally reuse existing containers when configuration unchanged
* Move etcd inline configuration flags to external file to allow
  restarts without container recreation (mainly for the
  force-new-cluster flag)
* Archive previous container, and configuration, as *-previous before
  replacement. Only one copy is maintained to limit disk usage.

Signed-off-by: Carlo Lobrano <[email protected]>

{
echo "cipher-suites:"
IFS=',' read -ra cipher_array <<< "$ETCD_CIPHER_SUITES"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<<< is bash-specific, so you should avoid it if you can. If not we'll have to rename it to .in, update the hashbang to #!@BASH_SHELL@ and add it to "configure.ac" to auto-update.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I'll try to find an alternative

@clobrano clobrano force-pushed the ocpbugs-57456-tnf-podman-etcd-keep-container-for-crash-debugging branch from d886489 to c353e06 Compare August 27, 2025 14:37
Copy link

knet-jenkins bot commented Aug 27, 2025

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/8/input

# Archive corresponding etcd configuration file
etcd_configuration_file_previous=${OCF_RESKEY_etcd_configuration_file//".yaml"/"-previous.yaml"}
if ocf_run cp "$OCF_RESKEY_etcd_configuration_file" "$etcd_configuration_file_previous"; then
return "$OCF_SUCCESS"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about this return - aren't we also planning to archive the old logs? Is it sufficient to just have the old yaml?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, don't we need to remove the container still?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which logs? Aren't the logs inside the container?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldnt it be a lot easier to mount a log volume?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the output of podman logs {container}
I'm pretty sure they are stored in /var/log/container by default, but they are deleted when then container is deleted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait I lost the context here. You're saying that this line: https://github.com/ClusterLabs/resource-agents/pull/2062/files#diff-2600c60d13bcbda50a7ac6542e535b71029aabd73ddd86e188ea997508e1e6e7R345

Renames the container, which in turn will ensure that podman saves its log record at the previous location. That seems right, ignore the comment above :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow I saw the cp command but missing the rename command above it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw etcd logs also in pacemaker's journactl. I thought we wanted to keep the whole container for later inspection.

if is_force_new_cluster; then
# The embedded newline is required for correct YAML formatting.
FORCE_NEW_CLUSTER_CONFIG="force-new-cluster: true
force-new-cluster-bump-amount: 1000000000"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something we left out initially. It's part of the restore-quorum script in CEO. The script is more advanced, calculating the bump automatically. I had a simpler change for testing, which I forgot to complete.

openshift/cluster-etcd-operator@7a5c891

# NOTE: the pod manifest contains some values referred to its REVISION that we want to ignore
jq_filter='del(.metadata.labels.revision) | .spec.containers[] |= ( .env |= map(select( .name != "ETCD_STATIC_POD_VERSION" ))) | .spec.volumes |= map( select( .name != "resource-dir" ))'

ocf_run diff -s <(jq "$jq_filter" "$OCF_RESKEY_pod_manifest") <(jq "$jq_filter" "$OCF_RESKEY_pod_manifest_copy")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this print out the diff so we can see it in the runtime log?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -1206,6 +1393,7 @@ podman_start()
fi
;;
2)
# TODO: can we start "normally", regardless the revisions, if the container-id is the same on both nodes?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the revision we're referring to is the revision of the static pod yaml generated by CEO, I'm pretty sure the answer is yes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but since this was a different story, I decided it was best to at least leave a message for the future :)

Copy link

@jaypoulz jaypoulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@clobrano clobrano marked this pull request as draft August 28, 2025 05:57
@clobrano clobrano marked this pull request as ready for review August 28, 2025 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants