OCPBUGS-57456: podman-etcd should keep the container for crash debugging #2062

clobrano · 2025-07-23T12:54:21Z

This change modifies the agent to keep stopped containers for log inspection and debugging, with supporting changes to enable this behavior.

Conditionally reuse existing containers when configuration unchanged
Move etcd inline configuration flags to external file to allow restarts without container recreation (mainly for the force-new-cluster flag)
Archive previous container, and configuration, as *-previous before replacement. Only one copy is maintained to limit disk usage.

knet-jenkins · 2025-07-23T12:55:28Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/1/input

jaypoulz

Definetely seems like a step forward for debugging purposes. I am concerned about ending up with an infinite list of old containers though. Is is possible to just keep the previous and current ones?

clobrano · 2025-07-24T06:12:26Z

The existing container is deleted right before starting a new one. I could improve it to keep the last one only

knet-jenkins · 2025-07-29T15:41:04Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/2/input

clobrano · 2025-08-01T09:14:56Z

As this PR also introduces configuration files for podman (env.yaml) and etcd (config.yaml), I also need to backup those files together with the podman container, otherwise we will be unable to check what data etcd is started with.

/hold

jaypoulz

I really like the new file-based configuration options. It think it makes it easier to follow along to where the configuration options are being sourced and propagated. Excited to also have these logs preserved.

jaypoulz · 2025-08-04T13:39:40Z

heartbeat/podman-etcd

+		FORCE_NEW_CLUSTER=false
+	fi
+
+	cat > "$OCF_RESKEY_podman_env_file" << EOF


For the collection of commented-out options, are these things we're leaving here because we expect to enable them down the line?

Those were leftovers I forgot to remove. However I noticed that some env variables (e.g. etcd_data) were not correctly used, and, thinking more about it, the env file for podman is not necessary anyway, as the only thing that can really change is the etcd command line. This to say that I need to push a version without podman env file :)

knet-jenkins · 2025-08-12T09:53:22Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/3/input

knet-jenkins · 2025-08-12T10:07:19Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/4/input

knet-jenkins · 2025-08-12T14:26:26Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/5/input

knet-jenkins · 2025-08-12T14:27:38Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/6/input

clobrano · 2025-08-14T12:16:55Z

Let's keep it on hold again. Notice this line from etcd that should be investigated more

{"level":"info","ts":"2025-08-14T12:14:04.430103Z","caller":"etcdmain/config.go:370","msg":"loaded server configuration, other configuration command line flags and environment variables will be ignored if provided","path":"/var/lib/etcd/config.yaml"}

clobrano · 2025-08-14T12:17:05Z

/hold

knet-jenkins · 2025-08-27T14:13:24Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/7/input

This change modifies the agent to keep stopped containers for log inspection and debugging, with supporting changes to enable this behavior. * Conditionally reuse existing containers when configuration unchanged * Move etcd inline configuration flags to external file to allow restarts without container recreation (mainly for the force-new-cluster flag) * Archive previous container, and configuration, as *-previous before replacement. Only one copy is maintained to limit disk usage. Signed-off-by: Carlo Lobrano <[email protected]>

oalbrigt · 2025-08-27T14:35:19Z

heartbeat/podman-etcd

+
+	{
+		echo "cipher-suites:"
+		IFS=',' read -ra cipher_array <<< "$ETCD_CIPHER_SUITES"


<<< is bash-specific, so you should avoid it if you can. If not we'll have to rename it to .in, update the hashbang to #!@BASH_SHELL@ and add it to "configure.ac" to auto-update.

oh, I'll try to find an alternative

knet-jenkins · 2025-08-27T14:38:05Z

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2062/8/input

jaypoulz · 2025-08-27T19:59:41Z

heartbeat/podman-etcd

+			# Archive corresponding etcd configuration file
+			etcd_configuration_file_previous=${OCF_RESKEY_etcd_configuration_file//".yaml"/"-previous.yaml"}
+			if ocf_run cp "$OCF_RESKEY_etcd_configuration_file" "$etcd_configuration_file_previous"; then
+				return "$OCF_SUCCESS"


I'm confused about this return - aren't we also planning to archive the old logs? Is it sufficient to just have the old yaml?

Also, don't we need to remove the container still?

Which logs? Aren't the logs inside the container?

Wouldnt it be a lot easier to mount a log volume?

I mean the output of podman logs {container}
I'm pretty sure they are stored in /var/log/container by default, but they are deleted when then container is deleted.

Wait I lost the context here. You're saying that this line: https://github.com/ClusterLabs/resource-agents/pull/2062/files#diff-2600c60d13bcbda50a7ac6542e535b71029aabd73ddd86e188ea997508e1e6e7R345

Renames the container, which in turn will ensure that podman saves its log record at the previous location. That seems right, ignore the comment above :)

Somehow I saw the cp command but missing the rename command above it

I saw etcd logs also in pacemaker's journactl. I thought we wanted to keep the whole container for later inspection.

jaypoulz · 2025-08-27T20:09:03Z

heartbeat/podman-etcd

+	if is_force_new_cluster; then
+		# The embedded newline is required for correct YAML formatting.
+		FORCE_NEW_CLUSTER_CONFIG="force-new-cluster: true
+force-new-cluster-bump-amount: 1000000000"


what is this for?

This is something we left out initially. It's part of the restore-quorum script in CEO. The script is more advanced, calculating the bump automatically. I had a simpler change for testing, which I forgot to complete.

openshift/cluster-etcd-operator@7a5c891

jaypoulz · 2025-08-27T20:19:05Z

heartbeat/podman-etcd

+	# NOTE: the pod manifest contains some values referred to its REVISION that we want to ignore
+	jq_filter='del(.metadata.labels.revision) | .spec.containers[] |= ( .env |= map(select( .name != "ETCD_STATIC_POD_VERSION" ))) | .spec.volumes |= map( select( .name != "resource-dir" ))'
+
+	ocf_run diff -s <(jq "$jq_filter" "$OCF_RESKEY_pod_manifest") <(jq "$jq_filter" "$OCF_RESKEY_pod_manifest_copy")


does this print out the diff so we can see it in the runtime log?

jaypoulz · 2025-08-27T20:25:07Z

heartbeat/podman-etcd

@@ -1206,6 +1393,7 @@ podman_start()
 					fi
 					;;
 				2)
+					# TODO: can we start "normally", regardless the revisions, if the container-id is the same on both nodes?


If the revision we're referring to is the revision of the static pod yaml generated by CEO, I'm pretty sure the answer is yes.

I agree, but since this was a different story, I decided it was best to at least leave a message for the future :)

jaypoulz

/lgtm

jaypoulz reviewed Jul 23, 2025

View reviewed changes

clobrano changed the title ~~[WIP] OCPBUGS-57456: podman-etcd should keep the container for crash debugging~~ OCPBUGS-57456: podman-etcd should keep the container for crash debugging Jul 30, 2025

clobrano marked this pull request as ready for review July 30, 2025 07:33

clobrano marked this pull request as draft August 1, 2025 09:15

jaypoulz approved these changes Aug 4, 2025

View reviewed changes

clobrano force-pushed the ocpbugs-57456-tnf-podman-etcd-keep-container-for-crash-debugging branch from 48e121c to d5da5f6 Compare August 12, 2025 09:52

clobrano force-pushed the ocpbugs-57456-tnf-podman-etcd-keep-container-for-crash-debugging branch from d5da5f6 to e6f68d0 Compare August 12, 2025 10:06

clobrano force-pushed the ocpbugs-57456-tnf-podman-etcd-keep-container-for-crash-debugging branch from e6f68d0 to 4c021cc Compare August 12, 2025 14:25

clobrano force-pushed the ocpbugs-57456-tnf-podman-etcd-keep-container-for-crash-debugging branch from 4c021cc to 1353f02 Compare August 12, 2025 14:26

clobrano marked this pull request as ready for review August 12, 2025 14:27

oalbrigt reviewed Aug 27, 2025

View reviewed changes

clobrano force-pushed the ocpbugs-57456-tnf-podman-etcd-keep-container-for-crash-debugging branch from d886489 to c353e06 Compare August 27, 2025 14:37

jaypoulz reviewed Aug 27, 2025

View reviewed changes

jaypoulz approved these changes Aug 27, 2025

View reviewed changes

clobrano marked this pull request as draft August 28, 2025 05:57

clobrano marked this pull request as ready for review August 28, 2025 10:12

OCPBUGS-57456: podman-etcd should keep the container for crash debugging #2062

Are you sure you want to change the base?

OCPBUGS-57456: podman-etcd should keep the container for crash debugging #2062

Conversation

clobrano commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knet-jenkins bot commented Jul 23, 2025

Uh oh!

jaypoulz left a comment

Choose a reason for hiding this comment

Uh oh!

clobrano commented Jul 24, 2025

Uh oh!

knet-jenkins bot commented Jul 29, 2025

Uh oh!

clobrano commented Aug 1, 2025

Uh oh!

jaypoulz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knet-jenkins bot commented Aug 12, 2025

Uh oh!

knet-jenkins bot commented Aug 12, 2025

Uh oh!

knet-jenkins bot commented Aug 12, 2025

Uh oh!

knet-jenkins bot commented Aug 12, 2025

Uh oh!

clobrano commented Aug 14, 2025

Uh oh!

clobrano commented Aug 14, 2025

Uh oh!

knet-jenkins bot commented Aug 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knet-jenkins bot commented Aug 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaypoulz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clobrano commented Jul 23, 2025 •

edited

Loading