Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster broken with upgrading 1.11.0 -> 1.14.0 #2852

Open
baznikin opened this issue Jan 23, 2025 · 2 comments
Open

Cluster broken with upgrading 1.11.0 -> 1.14.0 #2852

baznikin opened this issue Jan 23, 2025 · 2 comments

Comments

@baznikin
Copy link

baznikin commented Jan 23, 2025

First of all, sorry for long logs and unstructured message. To write clean issue you have to have at least some understanding of what happens, but I have no idea yet. I read release notes on 1.12, 1.13 and 1.14 and descide I can upgrade stright to 1.14.0. But...

After upgrading postgres-operator 1.11.0 to 1.14.0 my clusters won't startup:

$ kubectl get postgresqls.acid.zalan.do -A
NAMESPACE            NAME                  TEAM               VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE    STATUS
brandadmin-staging   brandadmin-pg         develop            16        1      100Gi    1             500Mi            429d   SyncFailed
ga                   games-aggregator-pg   games-aggregator   16        2      125Gi    1000m         512Mi            157d   SyncFailed
payments             payments-pg           develop            16        1      20Gi     1             500Mi            457d   Running
sprint-reports       asana-automate-db     sprint             16        1      25Gi     1             500Mi            358d   Running
staging              develop-postgresql    develop            17        2      250Gi    1             2Gi              435d   UpdateFailed

3 clusters successfully started with updated spilo image (payments-pg, asana-automate-db and develop-postgresql) and 2 - not (brandadmin-pg and games-aggregator-pg). Before I noticed not clusters are updated, I initialized upgrade 16 -> 17 on cluster develop-postgresql and it stuck with same symptoms (at first I thought it is this reason, but now I don't thinks so, see below):

2025-01-23 15:59:28,706 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

and no more logs.

Some clusters managed to start there is same error:

$ kubectl -n sprint-reports logs asana-automate-db-0
2025-01-23 15:38:54,983 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 15:38:55,040 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 15:38:55,043 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 15:38:55,191 - bootstrapping - INFO - Configuring certificate
2025-01-23 15:38:55,192 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 15:38:55,775 - bootstrapping - INFO - Configuring pgqd
2025-01-23 15:38:55,776 - bootstrapping - INFO - Configuring wal-e
2025-01-23 15:38:55,777 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2025-01-23 15:38:55,777 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2025-01-23 15:38:55,778 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2025-01-23 15:38:55,779 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2025-01-23 15:38:55,779 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2025-01-23 15:38:55,779 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2025-01-23 15:38:55,779 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2025-01-23 15:38:55,780 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2025-01-23 15:38:55,780 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2025-01-23 15:38:55,781 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2025-01-23 15:38:55,781 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2025-01-23 15:38:55,781 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2025-01-23 15:38:55,781 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2025-01-23 15:38:55,782 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2025-01-23 15:38:55,782 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2025-01-23 15:38:55,793 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2025-01-23 15:38:55,793 - bootstrapping - INFO - Configuring crontab
2025-01-23 15:38:55,794 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 15:38:55,808 - bootstrapping - INFO - Configuring standby-cluster
2025-01-23 15:38:55,808 - bootstrapping - INFO - Configuring bootstrap
2025-01-23 15:38:55,808 - bootstrapping - INFO - Configuring pgbouncer
2025-01-23 15:38:55,808 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2025-01-23 15:38:55,808 - bootstrapping - INFO - Configuring patroni
2025-01-23 15:38:55,826 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-01-23 15:38:55,827 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
2025-01-23 15:38:57,916 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2025-01-23 15:38:57,974 INFO: No PostgreSQL configuration items changed, nothing to reload.
2025-01-23 15:38:57,995 WARNING: Postgresql is not running.
2025-01-23 15:38:57,995 INFO: Lock owner: ; I am asana-automate-db-0
2025-01-23 15:38:58,000 INFO: pg_controldata:

After I delete this pod it stuck too!

Processes inside of failed clusters:

root@develop-postgresql-0:/home/postgres# ps ax
    PID TTY      STAT   TIME COMMAND
      1 ?        Ss     0:00 /usr/bin/dumb-init -c --rewrite 1:0 -- /bin/sh /launch.sh
      7 ?        S      0:00 /bin/sh /launch.sh
     20 ?        S      0:00 /usr/bin/runsvdir -P /etc/service
     21 ?        Ss     0:00 runsv pgqd
     22 ?        S      0:00 /bin/bash /scripts/patroni_wait.sh --role primary -- /usr/bin/pgqd /home/postgres/pgq_ticker.ini
     83 ?        S      0:00 sleep 60
     84 pts/0    Ss     0:00 bash
     97 pts/0    R+     0:00 ps ax

After one more deletion it is managed to start.

I notice one thing in the logs - sometimes container starts with WAL-E variables, sometimes - not. Operator shows its status as OK, but it's not:

$ kubectl -n brandadmin-staging logs brandadmin-pg-0
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 15:38:43,529 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 15:38:43,587 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 15:38:43,588 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 15:38:43,726 - bootstrapping - INFO - Configuring wal-e
2025-01-23 15:38:43,727 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2025-01-23 15:38:43,728 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2025-01-23 15:38:43,728 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2025-01-23 15:38:43,729 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2025-01-23 15:38:43,729 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2025-01-23 15:38:43,730 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2025-01-23 15:38:43,730 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2025-01-23 15:38:43,730 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2025-01-23 15:38:43,731 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2025-01-23 15:38:43,731 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2025-01-23 15:38:43,732 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2025-01-23 15:38:43,732 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2025-01-23 15:38:43,732 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2025-01-23 15:38:43,733 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2025-01-23 15:38:43,733 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2025-01-23 15:38:43,736 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2025-01-23 15:38:43,736 - bootstrapping - INFO - Configuring certificate
2025-01-23 15:38:43,736 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 15:38:43,910 - bootstrapping - INFO - Configuring crontab
2025-01-23 15:38:43,910 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 15:38:43,931 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

$ kubectl -n brandadmin-staging get postgresqls.acid.zalan.do -A
NAMESPACE            NAME                  TEAM               VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE    STATUS
brandadmin-staging   brandadmin-pg         develop            16        1      100Gi    1             500Mi            429d   SyncFailed
ga                   games-aggregator-pg   games-aggregator   16        2      125Gi    1000m         512Mi            157d   SyncFailed
payments             payments-pg           develop            16        1      20Gi     1             500Mi            457d   Running
sprint-reports       asana-automate-db     sprint             16        1      25Gi     1             500Mi            358d   Running
staging              develop-postgresql    develop            17        2      250Gi    1             2Gi              435d   UpdateFailed

$ kubectl -n brandadmin-staging delete pod brandadmin-pg-0
pod "brandadmin-pg-0" deleted

$ kubectl -n brandadmin-staging get pod
NAME                                                         READY   STATUS             RESTARTS         AGE
brand-admin-backend-api-7b7856c75-d2ktr                      1/1     Running            0                22h
brand-admin-backend-api-7b7856c75-vczsg                      1/1     Running            0                22h
brand-admin-backend-async-tasks-69c5876799-nm4nh             1/1     Running            0                22h
brandadmin-pg-0                                              1/2     Running            0                82s

$ kubectl -n brandadmin-staging logs brandadmin-pg-0
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 15:59:27,840 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 15:59:27,896 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 15:59:27,897 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 15:59:28,051 - bootstrapping - INFO - Configuring crontab
2025-01-23 15:59:28,053 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 15:59:28,070 - bootstrapping - INFO - Configuring certificate
2025-01-23 15:59:28,070 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 15:59:28,706 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

$ kubectl -n brandadmin-staging get pod brandadmin-pg-0
NAME              READY   STATUS    RESTARTS   AGE
brandadmin-pg-0   1/2     Running   0          81m

$ kubectl -n brandadmin-staging get postgresqls.acid.zalan.do brandadmin-pg
NAME            TEAM      VERSION   PODS   VOLUME   CPU-REQUEST   MEMORY-REQUEST   AGE    STATUS
brandadmin-pg   develop   16        1      100Gi    1             500Mi            429d   Running

While I wrote this issue passed like an hour or so, in despair I restarted this failed pod one more time and it STARTED (container postgres became Ready), but still not working:

kubectl -n brandadmin-staging delete pod brandadmin-pg-0
pod "brandadmin-pg-0" deleted

$ kubectl  -n brandadmin-staging describe pod brandadmin-pg-0
Name:             brandadmin-pg-0
Namespace:        brandadmin-staging
Priority:         0
Service Account:  postgres-pod
Node:             pri-staging-wx2ci/10.106.0.35
Start Time:       Thu, 23 Jan 2025 18:26:41 +0100
Labels:           application=spilo
                  apps.kubernetes.io/pod-index=0
                  cluster-name=brandadmin-pg
                  controller-revision-hash=brandadmin-pg-5f65fc8dbd
                  spilo-role=master
                  statefulset.kubernetes.io/pod-name=brandadmin-pg-0
                  team=develop
Annotations:      prometheus.io/path: /metrics
                  prometheus.io/port: 9187
                  prometheus.io/scrape: true
                  status:
                    {"conn_url":"postgres://10.244.2.104:5432/postgres","api_url":"http://10.244.2.104:8008/patroni","state":"running","role":"primary","versi...
Status:           Running
IP:               10.244.2.104
IPs:
  IP:           10.244.2.104
Controlled By:  StatefulSet/brandadmin-pg
Containers:
  postgres:
    Container ID:   containerd://d67d695d8bce177e07b0ec3c23efbe59cc5349cb81e95abea6ba6e913fe7d836
    Image:          ghcr.io/zalando/spilo-17:4.0-p2
    Image ID:       ghcr.io/zalando/spilo-17@sha256:23861da069941ff5345e6a97455e60a63fc2f16c97857da8f85560370726cbe7
    Ports:          8008/TCP, 5432/TCP, 8080/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 23 Jan 2025 18:26:46 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     10
      memory:  6Gi
    Requests:
      cpu:      1
      memory:   500Mi
    Readiness:  http-get http://:8008/readiness delay=6s timeout=5s period=10s #success=1 #failure=3
    Environment:
      SCOPE:                        brandadmin-pg
      PGROOT:                       /home/postgres/pgdata/pgroot
      POD_IP:                        (v1:status.podIP)
      POD_NAMESPACE:                brandadmin-staging (v1:metadata.namespace)
      PGUSER_SUPERUSER:             postgres
      KUBERNETES_SCOPE_LABEL:       cluster-name
      KUBERNETES_ROLE_LABEL:        spilo-role
      PGPASSWORD_SUPERUSER:         <set to the key 'password' in secret 'postgres.brandadmin-pg.credentials.postgresql.acid.zalan.do'>  Optional: false
      PGUSER_STANDBY:               standby
      PGPASSWORD_STANDBY:           <set to the key 'password' in secret 'standby.brandadmin-pg.credentials.postgresql.acid.zalan.do'>  Optional: false
      PAM_OAUTH2:                   https://info.example.com/oauth2/tokeninfo?access_token= uid realm=/employees
      HUMAN_ROLE:                   zalandos
      PGVERSION:                    16
      KUBERNETES_LABELS:            {"application":"spilo"}
      SPILO_CONFIGURATION:          {"postgresql":{"parameters":{"shared_buffers":"1536MB"}},"bootstrap":{"initdb":[{"auth-host":"md5"},{"auth-local":"trust"}],"dcs":{"postgresql":{"parameters":{"checkpoint_completion_target":"0.9","default_statistics_target":"100","effective_cache_size":"4608MB","effective_io_concurrency":"200","hot_standby_feedback":"on","huge_pages":"off","jit":"false","maintenance_work_mem":"384MB","max_connections":"100","max_standby_archive_delay":"900s","max_standby_streaming_delay":"900s","max_wal_size":"4GB","min_wal_size":"1GB","random_page_cost":"1.1","wal_buffers":"16MB","work_mem":"7864kB"}},"failsafe_mode":true}}}
      DCS_ENABLE_KUBERNETES_API:    true
      ALLOW_NOSSL:                  true
      AWS_ACCESS_KEY_ID:            xxxx
      AWS_ENDPOINT:                 https://fra1.digitaloceanspaces.com
      AWS_SECRET_ACCESS_KEY:        xxxx
      CLONE_AWS_ACCESS_KEY_ID:      xxx
      CLONE_AWS_ENDPOINT:           https://fra1.digitaloceanspaces.com
      CLONE_AWS_SECRET_ACCESS_KEY:  xxxx
      LOG_S3_ENDPOINT:              https://fra1.digitaloceanspaces.com
      WAL_S3_BUCKET:                xxx-staging-db-wal
      WAL_BUCKET_SCOPE_SUFFIX:      /79c4fff8-6efb-477a-83bc-a43d34e8160a
      WAL_BUCKET_SCOPE_PREFIX:      
      LOG_S3_BUCKET:                xxx-staging-db-backups-all
      LOG_BUCKET_SCOPE_SUFFIX:      /79c4fff8-6efb-477a-83bc-a43d34e8160a
      LOG_BUCKET_SCOPE_PREFIX:      
    Mounts:
      /dev/shm from dshm (rw)
      /home/postgres/pgdata from pgdata (rw)
      /var/run/postgresql from postgresql-run (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9mghg (ro)
  exporter:
    Container ID:   containerd://48c54ad6591eaf9e60aa92b3235cb4878900fb46e94aacfeedcb70465d005619
    Image:          quay.io/prometheuscommunity/postgres-exporter:latest
    Image ID:       quay.io/prometheuscommunity/postgres-exporter@sha256:6999a7657e2f2fb0ca6ebf417213eebf6dc7d21b30708c622f6fcb11183a2bb0
    Port:           9187/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Thu, 23 Jan 2025 18:26:47 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  256Mi
    Requests:
      cpu:     100m
      memory:  200Mi
    Environment:
      POD_NAME:                             brandadmin-pg-0 (v1:metadata.name)
      POD_NAMESPACE:                        brandadmin-staging (v1:metadata.namespace)
      POSTGRES_USER:                        postgres
      POSTGRES_PASSWORD:                    <set to the key 'password' in secret 'postgres.brandadmin-pg.credentials.postgresql.acid.zalan.do'>  Optional: false
      DATA_SOURCE_URI:                      127.0.0.1:5432
      DATA_SOURCE_USER:                     $(POSTGRES_USER)
      DATA_SOURCE_PASS:                     $(POSTGRES_PASSWORD)
      PG_EXPORTER_AUTO_DISCOVER_DATABASES:  true
    Mounts:
      /home/postgres/pgdata from pgdata (rw)
      /var/run/postgresql from postgresql-run (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9mghg (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  pgdata:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pgdata-brandadmin-pg-0
    ReadOnly:   false
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  postgresql-run:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  kube-api-access-9mghg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             workloadKind=postgres:NoSchedule
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  22s   default-scheduler  Successfully assigned brandadmin-staging/brandadmin-pg-0 to pri-staging-wx2ci
  Normal  Pulled     18s   kubelet            Container image "ghcr.io/zalando/spilo-17:4.0-p2" already present on machine
  Normal  Created    18s   kubelet            Created container postgres
  Normal  Started    18s   kubelet            Started container postgres
  Normal  Pulling    18s   kubelet            Pulling image "quay.io/prometheuscommunity/postgres-exporter:latest"
  Normal  Pulled     17s   kubelet            Successfully pulled image "quay.io/prometheuscommunity/postgres-exporter:latest" in 455ms (455ms including waiting). Image size: 11070758 bytes.
  Normal  Created    17s   kubelet            Created container exporter
  Normal  Started    17s   kubelet            Started container exporter

$ kubectl -n brandadmin-staging logs brandadmin-pg-0
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 17:26:47,349 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 17:26:47,407 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 17:26:47,408 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 17:26:47,460 - bootstrapping - INFO - Configuring bootstrap
2025-01-23 17:26:47,462 - bootstrapping - INFO - Configuring standby-cluster
2025-01-23 17:26:47,462 - bootstrapping - INFO - Configuring certificate
2025-01-23 17:26:47,463 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 17:26:47,768 - bootstrapping - INFO - Configuring patroni
2025-01-23 17:26:47,792 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-01-23 17:26:47,793 - bootstrapping - INFO - Configuring wal-e
2025-01-23 17:26:47,794 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2025-01-23 17:26:47,794 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2025-01-23 17:26:47,795 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2025-01-23 17:26:47,795 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2025-01-23 17:26:47,795 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2025-01-23 17:26:47,796 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2025-01-23 17:26:47,796 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2025-01-23 17:26:47,796 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2025-01-23 17:26:47,796 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2025-01-23 17:26:47,797 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2025-01-23 17:26:47,797 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2025-01-23 17:26:47,797 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2025-01-23 17:26:47,797 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2025-01-23 17:26:47,798 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2025-01-23 17:26:47,798 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2025-01-23 17:26:47,801 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2025-01-23 17:26:47,802 - bootstrapping - INFO - Configuring crontab
2025-01-23 17:26:47,803 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 17:26:47,816 - bootstrapping - INFO - Configuring pam-oauth2
2025-01-23 17:26:47,817 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2025-01-23 17:26:47,817 - bootstrapping - INFO - Configuring pgbouncer
2025-01-23 17:26:47,817 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2025-01-23 17:26:47,818 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
2025-01-23 17:26:49,683 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2025-01-23 17:26:49,754 INFO: No PostgreSQL configuration items changed, nothing to reload.
2025-01-23 17:26:49,774 WARNING: Postgresql is not running.
2025-01-23 17:26:49,775 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:26:49,781 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071
  Database system identifier: 7369539194529993100
  Database cluster state: shut down
  pg_control last modified: Thu Jan 23 17:32:16 2025
  Latest checkpoint location: 5A/82000028
  Latest checkpoint's REDO location: 5A/82000028
  Latest checkpoint's REDO WAL file: 0000001B0000005A00000082
  Latest checkpoint's TimeLineID: 27
  Latest checkpoint's PrevTimeLineID: 27
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:929334
  Latest checkpoint's NextOID: 873526
  Latest checkpoint's NextMultiXactId: 19
  Latest checkpoint's NextMultiOffset: 37
  Latest checkpoint's oldestXID: 717
  Latest checkpoint's oldestXID's DB: 5
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 5
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Thu Jan 23 17:32:16 2025
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: 389f9007f77b578836bfcee51eabb488b11042d00c48d0d84e718a826ce23d29

2025-01-23 17:32:36,148 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:32:36,326 INFO: starting as a secondary
2025-01-23 17:32:36 UTC [51]: [1-1] 67927d34.33 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2025-01-23 17:32:36 UTC [51]: [2-1] 67927d34.33 0     LOG:  pg_stat_kcache.linux_hz is set to 125000
2025-01-23 17:32:36 UTC [51]: [3-1] 67927d34.33 0     FATAL:  could not load server certificate file "/run/certs/server.crt": No such file or directory
2025-01-23 17:32:36 UTC [51]: [4-1] 67927d34.33 0     LOG:  database system is shut down
2025-01-23 17:32:36,971 INFO: postmaster pid=51
/var/run/postgresql:5432 - no response
2025-01-23 17:32:46,146 WARNING: Postgresql is not running.
2025-01-23 17:32:46,146 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:32:46,149 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071
  Database system identifier: 7369539194529993100
  Database cluster state: shut down
  pg_control last modified: Thu Jan 23 17:32:16 2025
  Latest checkpoint location: 5A/82000028
  Latest checkpoint's REDO location: 5A/82000028
  Latest checkpoint's REDO WAL file: 0000001B0000005A00000082
  Latest checkpoint's TimeLineID: 27
  Latest checkpoint's PrevTimeLineID: 27
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:929334
  Latest checkpoint's NextOID: 873526
  Latest checkpoint's NextMultiXactId: 19
  Latest checkpoint's NextMultiOffset: 37
  Latest checkpoint's oldestXID: 717
  Latest checkpoint's oldestXID's DB: 5
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 5
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Thu Jan 23 17:32:16 2025
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: 389f9007f77b578836bfcee51eabb488b11042d00c48d0d84e718a826ce23d29

2025-01-23 17:32:46,162 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:32:46,190 INFO: starting as a secondary
2025-01-23 17:32:46 UTC [62]: [1-1] 67927d3e.3e 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2025-01-23 17:32:46 UTC [62]: [2-1] 67927d3e.3e 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
2025-01-23 17:32:46 UTC [62]: [3-1] 67927d3e.3e 0     FATAL:  could not load server certificate file "/run/certs/server.crt": No such file or directory
2025-01-23 17:32:46 UTC [62]: [4-1] 67927d3e.3e 0     LOG:  database system is shut down
2025-01-23 17:32:46,821 INFO: postmaster pid=62
/var/run/postgresql:5432 - no response
2025-01-23 17:32:56,143 WARNING: Postgresql is not running.
2025-01-23 17:32:56,144 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:32:56,146 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071

All my clusters consisting of two nodes can't start replica node: Probably problem is with WAL variables...

$ kubectl -n staging exec -it develop-postgresql-0 -- patronictl topology
Defaulted container "postgres" out of: postgres, exporter
+ Cluster: develop-postgresql (7369262358642845868) --------+----+-----------+
| Member                 | Host         | Role    | State   | TL | Lag in MB |
+------------------------+--------------+---------+---------+----+-----------+
| develop-postgresql-0   | 10.244.0.253 | Leader  | running | 39 |           |
| + develop-postgresql-1 |              | Replica |         |    |   unknown |
+------------------------+--------------+---------+---------+----+-----------+
$ kubectl -n staging logs develop-postgresql-0 | head -20
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 16:20:51,723 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 16:20:51,766 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 16:20:51,767 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 16:20:51,823 - bootstrapping - INFO - Configuring patroni
2025-01-23 16:20:51,846 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-01-23 16:20:51,847 - bootstrapping - INFO - Configuring pgbouncer
2025-01-23 16:20:51,847 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2025-01-23 16:20:51,848 - bootstrapping - INFO - Configuring standby-cluster
2025-01-23 16:20:51,848 - bootstrapping - INFO - Configuring pam-oauth2
2025-01-23 16:20:51,848 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2025-01-23 16:20:51,848 - bootstrapping - INFO - Configuring crontab
2025-01-23 16:20:51,848 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 16:20:51,868 - bootstrapping - INFO - Configuring certificate
2025-01-23 16:20:51,868 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 16:20:53,422 - bootstrapping - INFO - Configuring bootstrap
2025-01-23 16:20:53,423 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main

$ kubectl -n staging exec -it develop-postgresql-0 -- tail /home/postgres/pgdata/pgroot/pg_log/postgresql-4.log
Defaulted container "postgres" out of: postgres, exporter
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist

$ kubectl -n staging logs develop-postgresql-1 | head -20
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 16:38:15,383 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 16:38:15,424 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 16:38:15,424 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 16:38:15,473 - bootstrapping - INFO - Configuring bootstrap
2025-01-23 16:38:15,473 - bootstrapping - INFO - Configuring crontab
2025-01-23 16:38:15,473 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 16:38:15,482 - bootstrapping - INFO - Configuring standby-cluster
2025-01-23 16:38:15,482 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

$ kubectl -n staging exec -it develop-postgresql-1 -- tail /home/postgres/pgdata/pgroot/pg_log/postgresql-4.log
Defaulted container "postgres" out of: postgres, exporter
        STRUCTURED: time=2025-01-23T16:30:08.235670-00 pid=8215 action=push-wal key=s3://xxx-staging-db-wal/spilo/develop-postgresql/939ea78b-0caf-458f-a088-989352a97300/wal/16/wal_005/00000026000006700000009C.lzo prefix=spilo/develop-postgresql/939ea78b-0caf-458f-a088-989352a97300/wal/16/ rate=18353.3 seg=00000026000006700000009C state=complete
2025-01-23 16:30:12 UTC [8234]: [5-1] 67926e94.202a 0     LOG:  ending log output to stderr
2025-01-23 16:30:12 UTC [8234]: [6-1] 67926e94.202a 0     HINT:  Future log output will go to log destination "csvlog".
ERROR: 2025/01/23 16:30:12.698764 Archive '00000027.history' does not exist.
ERROR: 2025/01/23 16:30:13.204088 Archive '00000026000006700000009D' does not exist.
ERROR: 2025/01/23 16:30:13.573033 Archive '00000027.history' does not exist.
ERROR: 2025/01/23 16:30:13.845528 Archive '00000028.history' does not exist.
ERROR: 2025/01/23 16:30:14.117082 Archive '00000027.history' does not exist.
ERROR: 2025/01/23 16:30:14.478060 Archive '00000027000006700000009D' does not exist.
ERROR: 2025/01/23 16:30:14.807988 Archive '00000026000006700000009D' does not exist.

$ kubectl -n staging describe pod develop-postgresql-0
Name:             develop-postgresql-0
Namespace:        staging
Priority:         0
Service Account:  postgres-pod
Node:             pri-staging-wx2cv/10.106.0.46
Start Time:       Thu, 23 Jan 2025 17:20:44 +0100
Labels:           application=spilo
                  apps.kubernetes.io/pod-index=0
                  cluster-name=develop-postgresql
                  controller-revision-hash=develop-postgresql-5f869975bf
                  spilo-role=master
                  statefulset.kubernetes.io/pod-name=develop-postgresql-0
                  team=develop
Annotations:      prometheus.io/path: /metrics
                  prometheus.io/port: 9187
                  prometheus.io/scrape: true
                  status:
                    {"conn_url":"postgres://10.244.0.253:5432/postgres","api_url":"http://10.244.0.253:8008/patroni","state":"running","role":"primary","versi...
Status:           Running
IP:               10.244.0.253
IPs:
  IP:           10.244.0.253
Controlled By:  StatefulSet/develop-postgresql
Containers:
  postgres:
    Container ID:   containerd://5004728ea5d71484a313b6124f2534a839da5ef0527427cec1942f135aa33e93
    Image:          ghcr.io/zalando/spilo-17:4.0-p2
    Image ID:       ghcr.io/zalando/spilo-17@sha256:23861da069941ff5345e6a97455e60a63fc2f16c97857da8f85560370726cbe7
    Ports:          8008/TCP, 5432/TCP, 8080/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP
    State:          Running
      Started:      Thu, 23 Jan 2025 17:20:50 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     10
      memory:  13500Mi
    Requests:
      cpu:      1
      memory:   2Gi
    Readiness:  http-get http://:8008/readiness delay=6s timeout=5s period=10s #success=1 #failure=3
    Environment:
      SCOPE:                        develop-postgresql
      PGROOT:                       /home/postgres/pgdata/pgroot
      POD_IP:                        (v1:status.podIP)
      POD_NAMESPACE:                staging (v1:metadata.namespace)
      PGUSER_SUPERUSER:             postgres
      KUBERNETES_SCOPE_LABEL:       cluster-name
      KUBERNETES_ROLE_LABEL:        spilo-role
      PGPASSWORD_SUPERUSER:         <set to the key 'password' in secret 'postgres.develop-postgresql.credentials.postgresql.acid.zalan.do'>  Optional: false
      PGUSER_STANDBY:               standby
      PGPASSWORD_STANDBY:           <set to the key 'password' in secret 'standby.develop-postgresql.credentials.postgresql.acid.zalan.do'>  Optional: false
      PAM_OAUTH2:                   https://info.example.com/oauth2/tokeninfo?access_token= uid realm=/employees
      HUMAN_ROLE:                   zalandos
      PGVERSION:                    17
      KUBERNETES_LABELS:            {"application":"spilo"}
      SPILO_CONFIGURATION:          {"postgresql":{"parameters":{"shared_buffers":"3GB","shared_preload_libraries":"bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,pg_cron,pg_stat_kcache,decoderbufs"}},"bootstrap":{"initdb":[{"auth-host":"md5"},{"auth-local":"trust"}],"dcs":{"postgresql":{"parameters":{"checkpoint_completion_target":"0.9","default_statistics_target":"100","effective_cache_size":"9GB","effective_io_concurrency":"200","hot_standby_feedback":"on","huge_pages":"off","jit":"false","maintenance_work_mem":"768MB","max_connections":"200","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_standby_archive_delay":"900s","max_standby_streaming_delay":"900s","max_wal_size":"4GB","max_worker_processes":"8","min_wal_size":"1GB","random_page_cost":"1.1","wal_buffers":"16MB","wal_level":"logical","work_mem":"4MB"}},"failsafe_mode":true}}}
      DCS_ENABLE_KUBERNETES_API:    true
      ALLOW_NOSSL:                  true
      AWS_ACCESS_KEY_ID:            xxx
      AWS_ENDPOINT:                 https://fra1.digitaloceanspaces.com
      AWS_SECRET_ACCESS_KEY:        xxx
      CLONE_AWS_ACCESS_KEY_ID:      xxx
      CLONE_AWS_ENDPOINT:           https://fra1.digitaloceanspaces.com
      CLONE_AWS_SECRET_ACCESS_KEY:  xxx
      LOG_S3_ENDPOINT:              https://fra1.digitaloceanspaces.com
      WAL_S3_BUCKET:                xxx-staging-db-wal
      WAL_BUCKET_SCOPE_SUFFIX:      /939ea78b-0caf-458f-a088-989352a97300
      WAL_BUCKET_SCOPE_PREFIX:      
      LOG_S3_BUCKET:                xxx-staging-db-backups-all
      LOG_BUCKET_SCOPE_SUFFIX:      /939ea78b-0caf-458f-a088-989352a97300
      LOG_BUCKET_SCOPE_PREFIX:      
    Mounts:

It's complete mess!

Operator installed with Helm and terraform. Configured with ConfigMap:

resource "kubectl_manifest" "postgres-pod-config" {
  yaml_body = <<-EOF
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: postgres-pod-config
      namespace: ${var.namespace}
    data:
      ALLOW_NOSSL: "true"
      # WAL archiving and physical basebackups for PITR
      AWS_ENDPOINT: ${local.s3_endpoint}
      AWS_SECRET_ACCESS_KEY: ${local.s3_secret_key}
      AWS_ACCESS_KEY_ID: ${local.s3_access_id}
      # default values for cloning a cluster (same as above)
      CLONE_AWS_ENDPOINT: ${local.clone_s3_endpoint}
      CLONE_AWS_SECRET_ACCESS_KEY: ${local.clone_s3_secret_key}
      CLONE_AWS_ACCESS_KEY_ID: ${local.clone_s3_access_id}
      # send pg_logs to s3 (work in progress)
      LOG_S3_ENDPOINT: ${local.s3_endpoint}
    EOF
}

resource "helm_release" "postgres-operator" {
  name       = "postgres-operator"
  namespace  = var.namespace
  chart      = "postgres-operator"
  repository = "https://opensource.zalando.com/postgres-operator/charts/postgres-operator"
  version    = "1.14.0"

  depends_on = [kubectl_manifest.postgres-pod-config]

  dynamic "set" {
    for_each = var.wal_backup ? ["yes"] : []
    content {
      name  = "configAwsOrGcp.wal_s3_bucket"
      value = local.bucket_name_wal
    }
  }

  dynamic "set" {
    for_each = var.log_backup ? ["yes"] : []
    content {
      name  = "configAwsOrGcp.log_s3_bucket"
      value = "${var.name}-db-backups-all" # bucket with logical backups; 15 days ttl
    }
  }

  set {
    name  = "configLogicalBackup.logical_backup_s3_access_key_id"
    value = local.s3_access_id
  }

  set {
    name  = "configLogicalBackup.logical_backup_s3_bucket"
    value = local.bucket_name_backups
  }

  set {
    name  = "configLogicalBackup.logical_backup_s3_region"
    value = var.bucket_region
  }

  set {
    name  = "configLogicalBackup.logical_backup_s3_endpoint"
    value = local.s3_endpoint
  }

  set {
    name  = "configKubernetes.pod_environment_configmap"
    value = "${var.namespace}/postgres-pod-config"
  }
  set {
    name  = "configLogicalBackup.logical_backup_s3_secret_access_key"
    value = local.s3_secret_key
  }

  values = [<<-YAML
    configConnectionPooler:
      connection_pooler_image: "registry.xxx.com/devops/postgres-zalando-pgbouncer:master-32"

    configLogicalBackup:
      logical_backup_docker_image: "registry.xxx.com/devops/postgres-logical-backup:0.6"
      logical_backup_schedule: "32 8 * * *"
      logical_backup_s3_retention_time: "2 week"

    configKubernetes:
      enable_pod_antiaffinity: true
      # it doesn't influence pulling of images from public repos (like operator image) if there is no such secret
      # but will help to fetch postgres-logical-backup image
      pod_service_account_definition: |
        apiVersion: v1
        kind: ServiceAccount
        metadata:
          name: postgres-pod
        imagePullSecrets:
          - name: gitlab-registry-token
      # became disabled by default since 1.9.0 https://github.com/zalando/postgres-operator/releases/tag/v1.9.0
      # Quote: "We recommend enable_readiness_probe: true with pod_management_policy: parallel"
      enable_readiness_probe: true
      # Quote: "We recommend enable_readiness_probe: true with pod_management_policy: parallel"
      pod_management_policy: "parallel"
      enable_sidecars: true
      share_pgsocket_with_sidecars: true
      custom_pod_annotations:
        prometheus.io/scrape: "true"
        prometheus.io/path: "/metrics"
        prometheus.io/port: "9187"

    configPatroni:
      # https://patroni.readthedocs.io/en/master/dcs_failsafe_mode.html
      enable_patroni_failsafe_mode: true

    configGeneral:
      sidecars:
        - name: exporter
          image: quay.io/prometheuscommunity/postgres-exporter:latest
          ports:
            - name: exporter
              containerPort: 9187
              protocol: TCP
          resources:
            limits:
              cpu: 500m
              memory: 256Mi
            requests:
              cpu: 100m
              memory: 200Mi
          env:
            - name: DATA_SOURCE_URI
              value: "127.0.0.1:5432"
            - name: DATA_SOURCE_USER
              value: "$(POSTGRES_USER)"
            - name: DATA_SOURCE_PASS
              value: "$(POSTGRES_PASSWORD)"
            - name: PG_EXPORTER_AUTO_DISCOVER_DATABASES
              value: "true"
  YAML
  ]
}
@baznikin baznikin changed the title Issues with upgrading 1.11.0 -> 1.14.0 Cluster broken with upgrading 1.11.0 -> 1.14.0 Jan 23, 2025
@baznikin
Copy link
Author

Downgrading to 1.11.0 resolve my issues

@baznikin
Copy link
Author

Update.
I tried sequential upgrade 1.11.0 -> 1.12.2 -> 1.13.0 -> 1.14.0 on same k8s cluster. Up to 1.13.0 - no issues.
With 1.14.1:

  1. first master replicas started successfully:
    a. clusters, consisting of 1 replica rebooted with ghcr.io/zalando/spilo-17:4.0-p2 started up successfully
    b. clusters, consisting of 2 replicas, 1st replica booted up OK, second stuck with error
2025-01-24 16:59:12,840 INFO: Lock owner: develop-postgresql-1; I am develop-postgresql-0
2025-01-24 16:59:12,853 INFO: Local timeline=43 lsn=679/8E0616D0
2025-01-24 16:59:12,888 INFO: primary_timeline=43
2025-01-24 16:59:12,888 INFO: Lock owner: develop-postgresql-1; I am develop-postgresql-0
2025-01-24 16:59:12,889 INFO: starting as a secondary
2025-01-24 16:59:13 UTC [166]: [1-1] 6793c6e1.a6 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2025-01-24 16:59:13 UTC [166]: [2-1] 6793c6e1.a6 0     LOG:  pg_stat_kcache.linux_hz is set to 500000
2025-01-24 16:59:13 UTC [166]: [3-1] 6793c6e1.a6 0     FATAL:  could not load server certificate file "/run/certs/server.crt": No such file or directory
2025-01-24 16:59:13 UTC [166]: [4-1] 6793c6e1.a6 0     LOG:  database system is shut down
2025-01-24 16:59:13,616 INFO: postmaster pid=166
+ Cluster: develop-postgresql (7369262358642845868) -------------+----+-----------+
| Member                 | Host         | Role    | State        | TL | Lag in MB |
+------------------------+--------------+---------+--------------+----+-----------+
| develop-postgresql-1   | 10.244.0.119 | Leader  | running      | 43 |           |
| + develop-postgresql-0 | 10.244.0.207 | Replica | start failed |    |   unknown |
+------------------------+--------------+---------+--------------+----+-----------+

After killing failed pod it booted up OK:

$ kubectl -n staging delete pod develop-postgresql-0
pod "develop-postgresql-0" deleted
$ kubectl --context fxg-betsofa-staging -n staging exec -it develop-postgresql-1 -- patronictl topology
Defaulted container "postgres" out of: postgres, exporter
+ Cluster: develop-postgresql (7369262358642845868) ----------+----+-----------+
| Member                 | Host         | Role    | State     | TL | Lag in MB |
+------------------------+--------------+---------+-----------+----+-----------+
| develop-postgresql-1   | 10.244.0.119 | Leader  | running   | 43 |           |
| + develop-postgresql-0 | 10.244.0.199 | Replica | streaming | 43 |         0 |
+------------------------+--------------+---------+-----------+----+-----------+

logs:

  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: 51d6c13aa188b64e24e9e0f8a7bd6c8ce18ec69719a92728f12fe51d80b6b22b

2025-01-24 17:00:21,562 INFO: Lock owner: develop-postgresql-1; I am develop-postgresql-0
2025-01-24 17:00:21,575 INFO: Local timeline=43 lsn=679/8E0616D0
2025-01-24 17:00:21,613 INFO: primary_timeline=43
2025-01-24 17:00:21,614 INFO: Lock owner: develop-postgresql-1; I am develop-postgresql-0
2025-01-24 17:00:21,791 INFO: starting as a secondary
2025-01-24 17:00:22 UTC [58]: [1-1] 6793c726.3a 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2025-01-24 17:00:22 UTC [58]: [2-1] 6793c726.3a 0     LOG:  pg_stat_kcache.linux_hz is set to 500000
2025-01-24 17:00:22,598 INFO: postmaster pid=58
/var/run/postgresql:5432 - no response
2025-01-24 17:00:22 UTC [58]: [3-1] 6793c726.3a 0     LOG:  redirecting log output to logging collector process
2025-01-24 17:00:22 UTC [58]: [4-1] 6793c726.3a 0     HINT:  Future log output will appear in directory "../pg_log".
2025-01-24 17:00:22,826 INFO: Lock owner: develop-postgresql-1; I am develop-postgresql-0
2025-01-24 17:00:22,872 INFO: restarting after failure in progress
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - rejecting connections
/var/run/postgresql:5432 - accepting connections
2025-01-24 17:00:25,747 INFO: Lock owner: develop-postgresql-1; I am develop-postgresql-0
2025-01-24 17:00:25,748 INFO: establishing a new patroni heartbeat connection to postgres
2025-01-24 17:00:25,918 INFO: Dropped unknown replication slot 'debezium'
2025-01-24 17:00:25,918 WARNING: Dropping physical replication slot develop_postgresql_1 because of its xmin value 83467746
2025-01-24 17:00:26,020 INFO: no action. I am (develop-postgresql-0), a secondary, and following a leader (develop-postgresql-1)
2025-01-24 17:00:32,875 INFO: no action. I am (develop-postgresql-0), a secondary, and following a leader (develop-postgresql-1)
  1. if restart pod of cluster, running 1 replica (i.e. successfully started right after operator upgrade), it failed to start:
kubectl --context fxg-betsofa-staging -n brandadmin-staging exec brandadmin-pg-0 -- patronictl topology
Defaulted container "postgres" out of: postgres, exporter
+ Cluster: brandadmin-pg (7369539194529993100) ----+----+-----------+
| Member          | Host        | Role   | State   | TL | Lag in MB |
+-----------------+-------------+--------+---------+----+-----------+
| brandadmin-pg-0 | 10.244.2.68 | Leader | running | 31 |           |
+-----------------+-------------+--------+---------+----+-----------+
$ kubectl -n brandadmin-staging delete pod brandadmin-pg-0
pod "brandadmin-pg-0" deleted
2025-01-24 17:04:21,354 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-24 17:04:21,403 - bootstrapping - INFO - No meta-data available for this provider
2025-01-24 17:04:21,404 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-24 17:04:21,441 - bootstrapping - INFO - Configuring patroni
2025-01-24 17:04:21,456 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-01-24 17:04:21,458 - bootstrapping - INFO - Configuring pam-oauth2
2025-01-24 17:04:21,459 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2025-01-24 17:04:21,459 - bootstrapping - INFO - Configuring bootstrap
2025-01-24 17:04:21,459 - bootstrapping - INFO - Configuring crontab
2025-01-24 17:04:21,459 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-24 17:04:21,468 - bootstrapping - INFO - Configuring pgqd
2025-01-24 17:04:21,468 - bootstrapping - INFO - Configuring standby-cluster
2025-01-24 17:04:21,468 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
2025-01-24 17:04:23,238 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2025-01-24 17:04:23,272 INFO: No PostgreSQL configuration items changed, nothing to reload.
2025-01-24 17:04:23,284 WARNING: Postgresql is not running.
2025-01-24 17:04:23,285 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-24 17:04:23,287 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071
  Database system identifier: 7369539194529993100
  Database cluster state: shut down
  pg_control last modified: Fri Jan 24 17:04:07 2025
  Latest checkpoint location: 5A/B9000028
  Latest checkpoint's REDO location: 5A/B9000028
  Latest checkpoint's REDO WAL file: 0000001F0000005A000000B9
  Latest checkpoint's TimeLineID: 31
  Latest checkpoint's PrevTimeLineID: 31
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:931095
  Latest checkpoint's NextOID: 873929
  Latest checkpoint's NextMultiXactId: 19
  Latest checkpoint's NextMultiOffset: 37
  Latest checkpoint's oldestXID: 717
  Latest checkpoint's oldestXID's DB: 5
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 5
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Fri Jan 24 17:04:07 2025
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending loc's timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 8192
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: 389f9007f77b578836bfcee51eabb488b11042d00c48d0d84e718a826ce23d29

2025-01-24 17:04:23,299 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-24 17:04:23,464 INFO: starting as a secondary
2025-01-24 17:04:23 UTC [56]: [1-1] 6793c817.38 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2025-01-24 17:04:23 UTC [56]: [2-1] 6793c817.38 0     LOG:  pg_stat_kcache.linux_hz is set to 1000000
2025-01-24 17:04:23 UTC [56]: [3-1] 6793c817.38 0     FATAL:  could not load server certificate file "/run/certs/server.crt": No such file or directory
2025-01-24 17:04:23 UTC [56]: [4-1] 6793c817.38 0     LOG:  database system is shut down
2025-01-24 17:04:24,002 INFO: postmaster pid=56
/var/run/postgresql:5432 - no response
2025-01-24 17:04:33,296 WARNING: Postgresql is not running.
2025-01-24 17:04:33,296 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-24 17:04:33,300 INFO: pg_controldata:
  pg_control version number: 1300
  Catalog version number: 202307071
  Database system identifier: 7369539194529993100
  Database cluster state: shut down

After one more restart it hangs early:

Defaulted container "postgres" out of: postgres, exporter
2025-01-24 17:18:07,151 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-24 17:18:07,213 - bootstrapping - INFO - No meta-data available for this provider
2025-01-24 17:18:07,214 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-24 17:18:07,291 - bootstrapping - INFO - Configuring pam-oauth2
2025-01-24 17:18:07,292 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2025-01-24 17:18:07,292 - bootstrapping - INFO - Configuring standby-cluster
2025-01-24 17:18:07,294 - bootstrapping - INFO - Configuring crontab
2025-01-24 17:18:07,295 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-24 17:18:07,311 - bootstrapping - INFO - Configuring bootstrap
2025-01-24 17:18:07,311 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
  File "/scripts/configure_spilo.py", line 1197, in <module>
    main()
  File "/scripts/configure_spilo.py", line 1159, in main
    write_log_environment(placeholders)
  File "/scripts/configure_spilo.py", line 794, in write_log_environment
    tags = json.loads(os.getenv('LOG_S3_TAGS'))
  File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
  1. if delete master replica in working cluster, consisting of 2 nodes, ex-master failed to start with same symptoms.

So, eventually, after series of pod restarts, all cluster will be dead. Reverted to 1.13.0
On version 1.13.0 I can't manage to break clusters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant