Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 13 additions & 15 deletions .github/workflows/build-and-release-dc_util.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,8 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
# os: [linux, windows, darwin]
os: [linux]
# goarch: [amd64, arm64]
goarch: [amd64]
goos: [linux]
goarch: [amd64, arm64]

steps:
- name: Checkout Repository
Expand All @@ -31,22 +29,22 @@ jobs:
run: |
mkdir -p build
cd utils/dc_util # Navigate to the directory containing dc_util.go
GOOS=${{ matrix.os }} GOARCH=${{ matrix.goarch }} go build -o ../../build/dc_util-${{ matrix.os }}-${{ matrix.goarch }}
GOOS=${{ matrix.goos }} GOARCH=${{ matrix.goarch }} go build -o ../../build/dc_util-${{ matrix.goos }}-${{ matrix.goarch }}

- name: Generate SHA256 Checksum
run: |
cd build
if [[ "${{ matrix.os }}" == "windows" ]]; then
sha256sum dc_util-${{ matrix.os }}-${{ matrix.goarch }}.exe > dc_util-${{ matrix.os }}-${{ matrix.goarch }}.exe.sha256
if [[ "${{ matrix.goos }}" == "windows" ]]; then
sha256sum dc_util-${{ matrix.goos }}-${{ matrix.goarch }}.exe > dc_util-${{ matrix.goos }}-${{ matrix.goarch }}.exe.sha256
else
sha256sum dc_util-${{ matrix.os }}-${{ matrix.goarch }} > dc_util-${{ matrix.os }}-${{ matrix.goarch }}.sha256
sha256sum dc_util-${{ matrix.goos }}-${{ matrix.goarch }} > dc_util-${{ matrix.goos }}-${{ matrix.goarch }}.sha256
fi

- name: Upload Binary and Checksum Artifacts
uses: actions/upload-artifact@v3
uses: actions/upload-artifact@v4
with:
name: dc_util-${{ matrix.os }}-${{ matrix.goarch }}
path: build/dc_util-${{ matrix.os }}-${{ matrix.goarch }}*
name: dc_util-${{ matrix.goos }}-${{ matrix.goarch }}
path: build/dc_util-${{ matrix.goos }}-${{ matrix.goarch }}*

- name: Clean Up Build Directory
run: |
Expand All @@ -57,14 +55,14 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
os: [linux, windows, darwin]
goos: [linux]
goarch: [amd64, arm64]

steps:
- name: Download Binary Artifact
uses: actions/download-artifact@v3
uses: actions/download-artifact@v4
with:
name: dc_util-${{ matrix.os }}-${{ matrix.goarch }}
name: dc_util-${{ matrix.goos }}-${{ matrix.goarch }}
path: ./release

- name: Create GitHub Release
Expand All @@ -85,7 +83,7 @@ jobs:
with:
upload_url: ${{ steps.create_release.outputs.upload_url }}
asset_path: ./release
asset_name: dc_util-${{ matrix.os }}-${{ matrix.goarch }}-${{ github.event.inputs.release_version }}
asset_name: dc_util-${{ matrix.goos }}-${{ matrix.goarch }}-${{ github.event.inputs.release_version }}
asset_content_type: application/octet-stream

- name: Clean Up Release Directory
Expand Down
2 changes: 2 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ Changelog
Unreleased
----------

* Fix hostname parsing and add tests in dc_util.

* Remove grand-central tables when restoring a full snapshot or grand-central tables.

2.53.0 (2025-09-25)
Expand Down
180 changes: 180 additions & 0 deletions utils/dc_util/CHANGES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

### Critical Fixes

- **CRITICAL: PostStart Hook Fail-Safe Implementation**
- **Issue**: PostStart hooks that return non-zero exit codes cause Kubernetes to kill and restart pods indefinitely
- **Root Cause**: Multiple error conditions in `handleResetRouting()` were returning errors instead of allowing pod startup
- **Impact**: During cluster formation or when Kubernetes API is unavailable, pods would enter CrashLoopBackOff
- **Solution**: Made postStart hook completely fail-safe - all error conditions now log and return `nil` (success)
- **Fail-Safe Points**:
- Kubernetes config creation failures
- Kubeconfig loading failures
- Kubernetes client creation failures
- Namespace retrieval failures
- Invalid hostname format detection
- StatefulSet retrieval failures
- Cluster readiness check failures
- SQL execution failures (already was safe)
- **Behavior**: All failures log with "allowing pod startup" message and never prevent container from starting
- **Benefits**: Eliminates circular dependency where pods can't start because postStart hook waits for cluster formation, but cluster can't form because pods won't start

### Added

- **Persistent Logging**: Dual logging to both STDOUT and persistent file with automatic rotation
- New `--log-file` CLI flag (default: `/resource/heapdump/dc_util.log`)
- Automatic file rotation when approaching 1MB to prevent disk space issues
- Failsafe design - continues STDOUT logging even if file logging fails
- Essential for debugging Kubernetes lifecycle hooks where container logs may not be accessible
- Creates directory structure if it doesn't exist

- **PostStart Hook Detection**: Intelligent detection of StatefulSet PostStart hooks
- Automatically scans StatefulSet containers for PostStart hooks with `dc_util --reset-routing`
- Prevents routing allocation changes when no PostStart hook exists to reset them
- Solves historical issue where `NEW_PRIMARIES` routing allocation could not be reliably reset
- Supports both single dash (`-reset-routing`) and double dash (`--reset-routing`) flag formats
- Precise word boundary matching prevents false positives from similar flag names
- Logs clear messages when PostStart hooks are found or missing

- **Single Node Cluster Detection**: Automatic detection and handling of single node clusters
- Detects when StatefulSet has exactly 1 replica and skips decommission
- Prevents unnecessary overhead and potential failures in single node deployments
- Clear logging explains why decommission was skipped
- Maintains existing behavior for multi-node clusters (≥2 replicas)

- **Configurable Lock File Path**: New `--lock-file` CLI flag
- Default: `/resource/heapdump/dc_util.lock`
- Allows customization for different deployment scenarios
- All lock file operations now use configurable path

- **Enhanced Flag Support**: Improved command-line flag handling
- Both `-reset-routing` and `--reset-routing` formats now supported
- Maintains backward compatibility with existing deployments
- Better error handling and validation

- **Multi-Architecture Support**: Automatic CPU architecture detection in hook configurations
- Hook examples now include automatic detection of x86_64/amd64 and aarch64/arm64 architectures
- Downloads appropriate binary based on detected architecture (`dc_util-linux-amd64` or `dc_util-linux-arm64`)
- Eliminates need for separate configuration files for different node architectures
- Graceful error handling for unsupported architectures

### Changed

- **Routing Allocation Logic**: Enhanced PreStop process with PostStart hook detection
- Routing allocation changes now only occur when corresponding PostStart hook exists
- Prevents permanent cluster misconfiguration in deployments without PostStart hooks
- More intelligent decision making based on actual StatefulSet configuration

- **Replica Count Handling**: Early replica count check moved to beginning of decommission process
- Zero replicas (scaled down): Immediately skips all operations including SQL and lock file creation
- Single replica: Immediately skips all operations to prevent failures and unnecessary overhead
- Multiple replicas: Proceeds with normal decommission process
- Prevents routing allocation SQL, lock file creation, and timeout calculations for single-node scenarios
- Better log messages explaining the decision for each scenario

- **PostStart Hook Optimization**: Early replica count check in `handleResetRouting()`
- Replica count check moved to very beginning of function (after lock file check)
- Zero replicas: "cluster is scaled down/suspended" - removes lock file and exits
- Single replica: "single node cluster" - removes lock file and exits
- 2+ replicas: Proceeds with routing allocation reset operations
- Prevents unnecessary cluster readiness checks and SQL operations for single-node deployments
- Cleaner logic flow and more predictable behavior

- **Main Decommission Optimization**: Early replica count check in `run()` function
- Replica count check moved immediately after StatefulSet retrieval
- Zero replicas: Exits before any SQL operations, lock file creation, or timeout calculations
- Single replica: Exits before any SQL operations, lock file creation, or timeout calculations
- 2+ replicas: Proceeds with full decommission process
- Eliminates unnecessary routing allocation changes and lock file operations for single-node scenarios
- Significantly improves efficiency and reduces log noise for scaled-down or single-node clusters

- **Function Signatures**: Updated internal functions to support configurable paths
- `createLockFile()` now accepts lock file path parameter
- `removeLockFile()` now accepts lock file path parameter
- `lockFileExists()` now accepts lock file path parameter
- `handleResetRouting()` now accepts lock file path parameter

### Improved

- **Logging Experience**: Comprehensive logging improvements
- All log messages now appear in both STDOUT and persistent file
- Better visibility into hook execution for debugging
- Historical logs available even after pod restarts
- Easier troubleshooting and operations monitoring

- **Documentation**: Extensively updated README.md
- Added "Recent Updates" section highlighting new features
- New "Replica Count Logic" section with examples
- Updated CLI parameter table with new flags
- Enhanced "PostStart Hook Detection" documentation
- Added complete "Persistent Logging" section with usage examples
- Updated sample logs sections to reflect new capabilities
- All hook configuration examples now include automatic architecture detection
- Clear separation between basic (preStop only) and complete (both hooks) configurations

- **Testing**: Comprehensive test coverage for all new features
- `TestHasPostStartHookWithResetRouting`: PostStart hook detection with various scenarios
- `TestPostStopRoutingAllocationIntegration`: Integration tests for routing allocation logic
- `TestLoggingIntegration`: Dual logging functionality verification
- `TestLogRotation`: File rotation behavior validation
- `TestSingleNodeClusterBehavior`: Single node cluster detection tests
- `TestReplicaCountBehavior`: Comprehensive replica count handling tests
- All existing tests updated to work with new function signatures

### Technical Details

- **New CLI Flags**:
- `--log-file string`: Path to persistent log file (default: `/resource/heapdump/dc_util.log`)
- `--lock-file string`: Path to lock file (default: `/resource/heapdump/dc_util.lock`)

- **New Functions**:
- `setupLogging(logFile string)`: Configures dual logging with rotation
- `hasPostStartHookWithResetRouting(statefulSet *appsv1.StatefulSet)`: PostStart hook detection
- Enhanced replica count logic in main decommission flow

- **Dependencies**: Added `path/filepath` import for directory handling

### Log Message Examples

```bash
# PostStart hook detection
Decommissioner: No postStart hook with dc_util --reset-routing or -reset-routing found, skipping pre-stop routing allocation change

# Single node cluster detection
Decommissioner: Single node cluster detected (replicas=1) -- Skipping decommission

# Architecture detection in hooks
ARCH=$(uname -m)
case $ARCH in
x86_64) BINARY_ARCH="amd64" ;;
aarch64) BINARY_ARCH="arm64" ;;
*) echo "Unsupported architecture: $ARCH"; exit 1 ;;
esac
curl -sLO https://example.com/dc_util-linux-${BINARY_ARCH}

# Persistent logging
Decommissioner: 2025/10/17 15:02:38 Using kubeconfig from /Users/walter/.kube/config
# (Same message appears in both STDOUT and /resource/heapdump/dc_util.log)
```

### Backward Compatibility

- All existing CLI flags and behavior remain unchanged
- Existing StatefulSet configurations continue to work without modification
- New features are opt-in via CLI flags or automatic detection
- No breaking changes to existing functionality

### Benefits

- **For Operations**: Persistent logs make debugging Kubernetes hooks significantly easier
- **For Development**: Enhanced testing capabilities with better dry-run logging
- **For Reliability**: Prevents cluster misconfigurations and single node failures
- **For Maintenance**: Clear logging and automatic file rotation reduce operational overhead
- **For Multi-Architecture**: Automatic architecture detection ensures compatibility across heterogeneous Kubernetes clusters
Loading