Skip to content

Commit bf7bfad

Browse files
committed
Fix hostname parsing and add tests
1 parent 08d1fca commit bf7bfad

File tree

14 files changed

+4449
-110
lines changed

14 files changed

+4449
-110
lines changed

.github/workflows/build-and-release-dc_util.yaml

Lines changed: 13 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,8 @@ jobs:
1313
runs-on: ubuntu-latest
1414
strategy:
1515
matrix:
16-
# os: [linux, windows, darwin]
17-
os: [linux]
18-
# goarch: [amd64, arm64]
19-
goarch: [amd64]
16+
goos: [linux]
17+
goarch: [amd64, arm64]
2018

2119
steps:
2220
- name: Checkout Repository
@@ -31,22 +29,22 @@ jobs:
3129
run: |
3230
mkdir -p build
3331
cd utils/dc_util # Navigate to the directory containing dc_util.go
34-
GOOS=${{ matrix.os }} GOARCH=${{ matrix.goarch }} go build -o ../../build/dc_util-${{ matrix.os }}-${{ matrix.goarch }}
32+
GOOS=${{ matrix.goos }} GOARCH=${{ matrix.goarch }} go build -o ../../build/dc_util-${{ matrix.goos }}-${{ matrix.goarch }}
3533
3634
- name: Generate SHA256 Checksum
3735
run: |
3836
cd build
39-
if [[ "${{ matrix.os }}" == "windows" ]]; then
40-
sha256sum dc_util-${{ matrix.os }}-${{ matrix.goarch }}.exe > dc_util-${{ matrix.os }}-${{ matrix.goarch }}.exe.sha256
37+
if [[ "${{ matrix.goos }}" == "windows" ]]; then
38+
sha256sum dc_util-${{ matrix.goos }}-${{ matrix.goarch }}.exe > dc_util-${{ matrix.goos }}-${{ matrix.goarch }}.exe.sha256
4139
else
42-
sha256sum dc_util-${{ matrix.os }}-${{ matrix.goarch }} > dc_util-${{ matrix.os }}-${{ matrix.goarch }}.sha256
40+
sha256sum dc_util-${{ matrix.goos }}-${{ matrix.goarch }} > dc_util-${{ matrix.goos }}-${{ matrix.goarch }}.sha256
4341
fi
4442
4543
- name: Upload Binary and Checksum Artifacts
46-
uses: actions/upload-artifact@v3
44+
uses: actions/upload-artifact@v4
4745
with:
48-
name: dc_util-${{ matrix.os }}-${{ matrix.goarch }}
49-
path: build/dc_util-${{ matrix.os }}-${{ matrix.goarch }}*
46+
name: dc_util-${{ matrix.goos }}-${{ matrix.goarch }}
47+
path: build/dc_util-${{ matrix.goos }}-${{ matrix.goarch }}*
5048

5149
- name: Clean Up Build Directory
5250
run: |
@@ -57,14 +55,14 @@ jobs:
5755
runs-on: ubuntu-latest
5856
strategy:
5957
matrix:
60-
os: [linux, windows, darwin]
58+
goos: [linux]
6159
goarch: [amd64, arm64]
6260

6361
steps:
6462
- name: Download Binary Artifact
65-
uses: actions/download-artifact@v3
63+
uses: actions/download-artifact@v4
6664
with:
67-
name: dc_util-${{ matrix.os }}-${{ matrix.goarch }}
65+
name: dc_util-${{ matrix.goos }}-${{ matrix.goarch }}
6866
path: ./release
6967

7068
- name: Create GitHub Release
@@ -85,7 +83,7 @@ jobs:
8583
with:
8684
upload_url: ${{ steps.create_release.outputs.upload_url }}
8785
asset_path: ./release
88-
asset_name: dc_util-${{ matrix.os }}-${{ matrix.goarch }}-${{ github.event.inputs.release_version }}
86+
asset_name: dc_util-${{ matrix.goos }}-${{ matrix.goarch }}-${{ github.event.inputs.release_version }}
8987
asset_content_type: application/octet-stream
9088

9189
- name: Clean Up Release Directory

CHANGES.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ Changelog
55
Unreleased
66
----------
77

8+
* Fix hostname parsing and add tests in dc_util.
9+
810
* Remove grand-central tables when restoring a full snapshot or grand-central tables.
911

1012
2.53.0 (2025-09-25)

utils/dc_util/CHANGES.md

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# Changelog
2+
3+
All notable changes to this project will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7+
8+
## [Unreleased]
9+
10+
### Critical Fixes
11+
12+
- **CRITICAL: PostStart Hook Fail-Safe Implementation**
13+
- **Issue**: PostStart hooks that return non-zero exit codes cause Kubernetes to kill and restart pods indefinitely
14+
- **Root Cause**: Multiple error conditions in `handleResetRouting()` were returning errors instead of allowing pod startup
15+
- **Impact**: During cluster formation or when Kubernetes API is unavailable, pods would enter CrashLoopBackOff
16+
- **Solution**: Made postStart hook completely fail-safe - all error conditions now log and return `nil` (success)
17+
- **Fail-Safe Points**:
18+
- Kubernetes config creation failures
19+
- Kubeconfig loading failures
20+
- Kubernetes client creation failures
21+
- Namespace retrieval failures
22+
- Invalid hostname format detection
23+
- StatefulSet retrieval failures
24+
- Cluster readiness check failures
25+
- SQL execution failures (already was safe)
26+
- **Behavior**: All failures log with "allowing pod startup" message and never prevent container from starting
27+
- **Benefits**: Eliminates circular dependency where pods can't start because postStart hook waits for cluster formation, but cluster can't form because pods won't start
28+
29+
### Added
30+
31+
- **Persistent Logging**: Dual logging to both STDOUT and persistent file with automatic rotation
32+
- New `--log-file` CLI flag (default: `/resource/heapdump/dc_util.log`)
33+
- Automatic file rotation when approaching 1MB to prevent disk space issues
34+
- Failsafe design - continues STDOUT logging even if file logging fails
35+
- Essential for debugging Kubernetes lifecycle hooks where container logs may not be accessible
36+
- Creates directory structure if it doesn't exist
37+
38+
- **PostStart Hook Detection**: Intelligent detection of StatefulSet PostStart hooks
39+
- Automatically scans StatefulSet containers for PostStart hooks with `dc_util --reset-routing`
40+
- Prevents routing allocation changes when no PostStart hook exists to reset them
41+
- Solves historical issue where `NEW_PRIMARIES` routing allocation could not be reliably reset
42+
- Supports both single dash (`-reset-routing`) and double dash (`--reset-routing`) flag formats
43+
- Precise word boundary matching prevents false positives from similar flag names
44+
- Logs clear messages when PostStart hooks are found or missing
45+
46+
- **Single Node Cluster Detection**: Automatic detection and handling of single node clusters
47+
- Detects when StatefulSet has exactly 1 replica and skips decommission
48+
- Prevents unnecessary overhead and potential failures in single node deployments
49+
- Clear logging explains why decommission was skipped
50+
- Maintains existing behavior for multi-node clusters (≥2 replicas)
51+
52+
- **Configurable Lock File Path**: New `--lock-file` CLI flag
53+
- Default: `/resource/heapdump/dc_util.lock`
54+
- Allows customization for different deployment scenarios
55+
- All lock file operations now use configurable path
56+
57+
- **Enhanced Flag Support**: Improved command-line flag handling
58+
- Both `-reset-routing` and `--reset-routing` formats now supported
59+
- Maintains backward compatibility with existing deployments
60+
- Better error handling and validation
61+
62+
- **Multi-Architecture Support**: Automatic CPU architecture detection in hook configurations
63+
- Hook examples now include automatic detection of x86_64/amd64 and aarch64/arm64 architectures
64+
- Downloads appropriate binary based on detected architecture (`dc_util-linux-amd64` or `dc_util-linux-arm64`)
65+
- Eliminates need for separate configuration files for different node architectures
66+
- Graceful error handling for unsupported architectures
67+
68+
### Changed
69+
70+
- **Routing Allocation Logic**: Enhanced PreStop process with PostStart hook detection
71+
- Routing allocation changes now only occur when corresponding PostStart hook exists
72+
- Prevents permanent cluster misconfiguration in deployments without PostStart hooks
73+
- More intelligent decision making based on actual StatefulSet configuration
74+
75+
- **Replica Count Handling**: Early replica count check moved to beginning of decommission process
76+
- Zero replicas (scaled down): Immediately skips all operations including SQL and lock file creation
77+
- Single replica: Immediately skips all operations to prevent failures and unnecessary overhead
78+
- Multiple replicas: Proceeds with normal decommission process
79+
- Prevents routing allocation SQL, lock file creation, and timeout calculations for single-node scenarios
80+
- Better log messages explaining the decision for each scenario
81+
82+
- **PostStart Hook Optimization**: Early replica count check in `handleResetRouting()`
83+
- Replica count check moved to very beginning of function (after lock file check)
84+
- Zero replicas: "cluster is scaled down/suspended" - removes lock file and exits
85+
- Single replica: "single node cluster" - removes lock file and exits
86+
- 2+ replicas: Proceeds with routing allocation reset operations
87+
- Prevents unnecessary cluster readiness checks and SQL operations for single-node deployments
88+
- Cleaner logic flow and more predictable behavior
89+
90+
- **Main Decommission Optimization**: Early replica count check in `run()` function
91+
- Replica count check moved immediately after StatefulSet retrieval
92+
- Zero replicas: Exits before any SQL operations, lock file creation, or timeout calculations
93+
- Single replica: Exits before any SQL operations, lock file creation, or timeout calculations
94+
- 2+ replicas: Proceeds with full decommission process
95+
- Eliminates unnecessary routing allocation changes and lock file operations for single-node scenarios
96+
- Significantly improves efficiency and reduces log noise for scaled-down or single-node clusters
97+
98+
- **Function Signatures**: Updated internal functions to support configurable paths
99+
- `createLockFile()` now accepts lock file path parameter
100+
- `removeLockFile()` now accepts lock file path parameter
101+
- `lockFileExists()` now accepts lock file path parameter
102+
- `handleResetRouting()` now accepts lock file path parameter
103+
104+
### Improved
105+
106+
- **Logging Experience**: Comprehensive logging improvements
107+
- All log messages now appear in both STDOUT and persistent file
108+
- Better visibility into hook execution for debugging
109+
- Historical logs available even after pod restarts
110+
- Easier troubleshooting and operations monitoring
111+
112+
- **Documentation**: Extensively updated README.md
113+
- Added "Recent Updates" section highlighting new features
114+
- New "Replica Count Logic" section with examples
115+
- Updated CLI parameter table with new flags
116+
- Enhanced "PostStart Hook Detection" documentation
117+
- Added complete "Persistent Logging" section with usage examples
118+
- Updated sample logs sections to reflect new capabilities
119+
- All hook configuration examples now include automatic architecture detection
120+
- Clear separation between basic (preStop only) and complete (both hooks) configurations
121+
122+
- **Testing**: Comprehensive test coverage for all new features
123+
- `TestHasPostStartHookWithResetRouting`: PostStart hook detection with various scenarios
124+
- `TestPostStopRoutingAllocationIntegration`: Integration tests for routing allocation logic
125+
- `TestLoggingIntegration`: Dual logging functionality verification
126+
- `TestLogRotation`: File rotation behavior validation
127+
- `TestSingleNodeClusterBehavior`: Single node cluster detection tests
128+
- `TestReplicaCountBehavior`: Comprehensive replica count handling tests
129+
- All existing tests updated to work with new function signatures
130+
131+
### Technical Details
132+
133+
- **New CLI Flags**:
134+
- `--log-file string`: Path to persistent log file (default: `/resource/heapdump/dc_util.log`)
135+
- `--lock-file string`: Path to lock file (default: `/resource/heapdump/dc_util.lock`)
136+
137+
- **New Functions**:
138+
- `setupLogging(logFile string)`: Configures dual logging with rotation
139+
- `hasPostStartHookWithResetRouting(statefulSet *appsv1.StatefulSet)`: PostStart hook detection
140+
- Enhanced replica count logic in main decommission flow
141+
142+
- **Dependencies**: Added `path/filepath` import for directory handling
143+
144+
### Log Message Examples
145+
146+
```bash
147+
# PostStart hook detection
148+
Decommissioner: No postStart hook with dc_util --reset-routing or -reset-routing found, skipping pre-stop routing allocation change
149+
150+
# Single node cluster detection
151+
Decommissioner: Single node cluster detected (replicas=1) -- Skipping decommission
152+
153+
# Architecture detection in hooks
154+
ARCH=$(uname -m)
155+
case $ARCH in
156+
x86_64) BINARY_ARCH="amd64" ;;
157+
aarch64) BINARY_ARCH="arm64" ;;
158+
*) echo "Unsupported architecture: $ARCH"; exit 1 ;;
159+
esac
160+
curl -sLO https://example.com/dc_util-linux-${BINARY_ARCH}
161+
162+
# Persistent logging
163+
Decommissioner: 2025/10/17 15:02:38 Using kubeconfig from /Users/walter/.kube/config
164+
# (Same message appears in both STDOUT and /resource/heapdump/dc_util.log)
165+
```
166+
167+
### Backward Compatibility
168+
169+
- All existing CLI flags and behavior remain unchanged
170+
- Existing StatefulSet configurations continue to work without modification
171+
- New features are opt-in via CLI flags or automatic detection
172+
- No breaking changes to existing functionality
173+
174+
### Benefits
175+
176+
- **For Operations**: Persistent logs make debugging Kubernetes hooks significantly easier
177+
- **For Development**: Enhanced testing capabilities with better dry-run logging
178+
- **For Reliability**: Prevents cluster misconfigurations and single node failures
179+
- **For Maintenance**: Clear logging and automatic file rotation reduce operational overhead
180+
- **For Multi-Architecture**: Automatic architecture detection ensures compatibility across heterogeneous Kubernetes clusters

0 commit comments

Comments
 (0)