|
| 1 | +# Changelog |
| 2 | + |
| 3 | +All notable changes to this project will be documented in this file. |
| 4 | + |
| 5 | +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), |
| 6 | +and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). |
| 7 | + |
| 8 | +## [Unreleased] |
| 9 | + |
| 10 | +### Critical Fixes |
| 11 | + |
| 12 | +- **CRITICAL: PostStart Hook Fail-Safe Implementation** |
| 13 | + - **Issue**: PostStart hooks that return non-zero exit codes cause Kubernetes to kill and restart pods indefinitely |
| 14 | + - **Root Cause**: Multiple error conditions in `handleResetRouting()` were returning errors instead of allowing pod startup |
| 15 | + - **Impact**: During cluster formation or when Kubernetes API is unavailable, pods would enter CrashLoopBackOff |
| 16 | + - **Solution**: Made postStart hook completely fail-safe - all error conditions now log and return `nil` (success) |
| 17 | + - **Fail-Safe Points**: |
| 18 | + - Kubernetes config creation failures |
| 19 | + - Kubeconfig loading failures |
| 20 | + - Kubernetes client creation failures |
| 21 | + - Namespace retrieval failures |
| 22 | + - Invalid hostname format detection |
| 23 | + - StatefulSet retrieval failures |
| 24 | + - Cluster readiness check failures |
| 25 | + - SQL execution failures (already was safe) |
| 26 | + - **Behavior**: All failures log with "allowing pod startup" message and never prevent container from starting |
| 27 | + - **Benefits**: Eliminates circular dependency where pods can't start because postStart hook waits for cluster formation, but cluster can't form because pods won't start |
| 28 | + |
| 29 | +### Added |
| 30 | + |
| 31 | +- **Persistent Logging**: Dual logging to both STDOUT and persistent file with automatic rotation |
| 32 | + - New `--log-file` CLI flag (default: `/resource/heapdump/dc_util.log`) |
| 33 | + - Automatic file rotation when approaching 1MB to prevent disk space issues |
| 34 | + - Failsafe design - continues STDOUT logging even if file logging fails |
| 35 | + - Essential for debugging Kubernetes lifecycle hooks where container logs may not be accessible |
| 36 | + - Creates directory structure if it doesn't exist |
| 37 | + |
| 38 | +- **PostStart Hook Detection**: Intelligent detection of StatefulSet PostStart hooks |
| 39 | + - Automatically scans StatefulSet containers for PostStart hooks with `dc_util --reset-routing` |
| 40 | + - Prevents routing allocation changes when no PostStart hook exists to reset them |
| 41 | + - Solves historical issue where `NEW_PRIMARIES` routing allocation could not be reliably reset |
| 42 | + - Supports both single dash (`-reset-routing`) and double dash (`--reset-routing`) flag formats |
| 43 | + - Precise word boundary matching prevents false positives from similar flag names |
| 44 | + - Logs clear messages when PostStart hooks are found or missing |
| 45 | + |
| 46 | +- **Single Node Cluster Detection**: Automatic detection and handling of single node clusters |
| 47 | + - Detects when StatefulSet has exactly 1 replica and skips decommission |
| 48 | + - Prevents unnecessary overhead and potential failures in single node deployments |
| 49 | + - Clear logging explains why decommission was skipped |
| 50 | + - Maintains existing behavior for multi-node clusters (≥2 replicas) |
| 51 | + |
| 52 | +- **Configurable Lock File Path**: New `--lock-file` CLI flag |
| 53 | + - Default: `/resource/heapdump/dc_util.lock` |
| 54 | + - Allows customization for different deployment scenarios |
| 55 | + - All lock file operations now use configurable path |
| 56 | + |
| 57 | +- **Enhanced Flag Support**: Improved command-line flag handling |
| 58 | + - Both `-reset-routing` and `--reset-routing` formats now supported |
| 59 | + - Maintains backward compatibility with existing deployments |
| 60 | + - Better error handling and validation |
| 61 | + |
| 62 | +- **Multi-Architecture Support**: Automatic CPU architecture detection in hook configurations |
| 63 | + - Hook examples now include automatic detection of x86_64/amd64 and aarch64/arm64 architectures |
| 64 | + - Downloads appropriate binary based on detected architecture (`dc_util-linux-amd64` or `dc_util-linux-arm64`) |
| 65 | + - Eliminates need for separate configuration files for different node architectures |
| 66 | + - Graceful error handling for unsupported architectures |
| 67 | + |
| 68 | +### Changed |
| 69 | + |
| 70 | +- **Routing Allocation Logic**: Enhanced PreStop process with PostStart hook detection |
| 71 | + - Routing allocation changes now only occur when corresponding PostStart hook exists |
| 72 | + - Prevents permanent cluster misconfiguration in deployments without PostStart hooks |
| 73 | + - More intelligent decision making based on actual StatefulSet configuration |
| 74 | + |
| 75 | +- **Replica Count Handling**: Early replica count check moved to beginning of decommission process |
| 76 | + - Zero replicas (scaled down): Immediately skips all operations including SQL and lock file creation |
| 77 | + - Single replica: Immediately skips all operations to prevent failures and unnecessary overhead |
| 78 | + - Multiple replicas: Proceeds with normal decommission process |
| 79 | + - Prevents routing allocation SQL, lock file creation, and timeout calculations for single-node scenarios |
| 80 | + - Better log messages explaining the decision for each scenario |
| 81 | + |
| 82 | +- **PostStart Hook Optimization**: Early replica count check in `handleResetRouting()` |
| 83 | + - Replica count check moved to very beginning of function (after lock file check) |
| 84 | + - Zero replicas: "cluster is scaled down/suspended" - removes lock file and exits |
| 85 | + - Single replica: "single node cluster" - removes lock file and exits |
| 86 | + - 2+ replicas: Proceeds with routing allocation reset operations |
| 87 | + - Prevents unnecessary cluster readiness checks and SQL operations for single-node deployments |
| 88 | + - Cleaner logic flow and more predictable behavior |
| 89 | + |
| 90 | +- **Main Decommission Optimization**: Early replica count check in `run()` function |
| 91 | + - Replica count check moved immediately after StatefulSet retrieval |
| 92 | + - Zero replicas: Exits before any SQL operations, lock file creation, or timeout calculations |
| 93 | + - Single replica: Exits before any SQL operations, lock file creation, or timeout calculations |
| 94 | + - 2+ replicas: Proceeds with full decommission process |
| 95 | + - Eliminates unnecessary routing allocation changes and lock file operations for single-node scenarios |
| 96 | + - Significantly improves efficiency and reduces log noise for scaled-down or single-node clusters |
| 97 | + |
| 98 | +- **Function Signatures**: Updated internal functions to support configurable paths |
| 99 | + - `createLockFile()` now accepts lock file path parameter |
| 100 | + - `removeLockFile()` now accepts lock file path parameter |
| 101 | + - `lockFileExists()` now accepts lock file path parameter |
| 102 | + - `handleResetRouting()` now accepts lock file path parameter |
| 103 | + |
| 104 | +### Improved |
| 105 | + |
| 106 | +- **Logging Experience**: Comprehensive logging improvements |
| 107 | + - All log messages now appear in both STDOUT and persistent file |
| 108 | + - Better visibility into hook execution for debugging |
| 109 | + - Historical logs available even after pod restarts |
| 110 | + - Easier troubleshooting and operations monitoring |
| 111 | + |
| 112 | +- **Documentation**: Extensively updated README.md |
| 113 | + - Added "Recent Updates" section highlighting new features |
| 114 | + - New "Replica Count Logic" section with examples |
| 115 | + - Updated CLI parameter table with new flags |
| 116 | + - Enhanced "PostStart Hook Detection" documentation |
| 117 | + - Added complete "Persistent Logging" section with usage examples |
| 118 | + - Updated sample logs sections to reflect new capabilities |
| 119 | + - All hook configuration examples now include automatic architecture detection |
| 120 | + - Clear separation between basic (preStop only) and complete (both hooks) configurations |
| 121 | + |
| 122 | +- **Testing**: Comprehensive test coverage for all new features |
| 123 | + - `TestHasPostStartHookWithResetRouting`: PostStart hook detection with various scenarios |
| 124 | + - `TestPostStopRoutingAllocationIntegration`: Integration tests for routing allocation logic |
| 125 | + - `TestLoggingIntegration`: Dual logging functionality verification |
| 126 | + - `TestLogRotation`: File rotation behavior validation |
| 127 | + - `TestSingleNodeClusterBehavior`: Single node cluster detection tests |
| 128 | + - `TestReplicaCountBehavior`: Comprehensive replica count handling tests |
| 129 | + - All existing tests updated to work with new function signatures |
| 130 | + |
| 131 | +### Technical Details |
| 132 | + |
| 133 | +- **New CLI Flags**: |
| 134 | + - `--log-file string`: Path to persistent log file (default: `/resource/heapdump/dc_util.log`) |
| 135 | + - `--lock-file string`: Path to lock file (default: `/resource/heapdump/dc_util.lock`) |
| 136 | + |
| 137 | +- **New Functions**: |
| 138 | + - `setupLogging(logFile string)`: Configures dual logging with rotation |
| 139 | + - `hasPostStartHookWithResetRouting(statefulSet *appsv1.StatefulSet)`: PostStart hook detection |
| 140 | + - Enhanced replica count logic in main decommission flow |
| 141 | + |
| 142 | +- **Dependencies**: Added `path/filepath` import for directory handling |
| 143 | + |
| 144 | +### Log Message Examples |
| 145 | + |
| 146 | +```bash |
| 147 | +# PostStart hook detection |
| 148 | +Decommissioner: No postStart hook with dc_util --reset-routing or -reset-routing found, skipping pre-stop routing allocation change |
| 149 | + |
| 150 | +# Single node cluster detection |
| 151 | +Decommissioner: Single node cluster detected (replicas=1) -- Skipping decommission |
| 152 | + |
| 153 | +# Architecture detection in hooks |
| 154 | +ARCH=$(uname -m) |
| 155 | +case $ARCH in |
| 156 | + x86_64) BINARY_ARCH="amd64" ;; |
| 157 | + aarch64) BINARY_ARCH="arm64" ;; |
| 158 | + *) echo "Unsupported architecture: $ARCH"; exit 1 ;; |
| 159 | +esac |
| 160 | +curl -sLO https://example.com/dc_util-linux-${BINARY_ARCH} |
| 161 | + |
| 162 | +# Persistent logging |
| 163 | +Decommissioner: 2025/10/17 15:02:38 Using kubeconfig from /Users/walter/.kube/config |
| 164 | +# (Same message appears in both STDOUT and /resource/heapdump/dc_util.log) |
| 165 | +``` |
| 166 | + |
| 167 | +### Backward Compatibility |
| 168 | + |
| 169 | +- All existing CLI flags and behavior remain unchanged |
| 170 | +- Existing StatefulSet configurations continue to work without modification |
| 171 | +- New features are opt-in via CLI flags or automatic detection |
| 172 | +- No breaking changes to existing functionality |
| 173 | + |
| 174 | +### Benefits |
| 175 | + |
| 176 | +- **For Operations**: Persistent logs make debugging Kubernetes hooks significantly easier |
| 177 | +- **For Development**: Enhanced testing capabilities with better dry-run logging |
| 178 | +- **For Reliability**: Prevents cluster misconfigurations and single node failures |
| 179 | +- **For Maintenance**: Clear logging and automatic file rotation reduce operational overhead |
| 180 | +- **For Multi-Architecture**: Automatic architecture detection ensures compatibility across heterogeneous Kubernetes clusters |
0 commit comments