A Kubernetes operator that enables zero-downtime live migration of applications using CRIU (Checkpoint/Restore In Userspace) with Object Storage integration. Designed specifically for Spot/Preemptible instances.
This operator provides:
- Automatic Migration: Detects spot instance interruptions and automatically migrates workloads
- Incremental Checkpoints: Regular pre-checkpoints with minimal overhead
- Object Storage Integration: Stores checkpoints in S3/MinIO/GCS for cross-node migration
- Lazy Page Loading: Fast restore with on-demand page fetching from object storage
- Kubernetes Native: CRD-based API with familiar kubectl workflows
┌─────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Migration Controller │ │
│ │ - Reconciles MigratableApp resources │ │
│ │ - Orchestrates migrations │ │
│ │ - Manages Pod lifecycle │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Node Monitor (DaemonSet) │ │
│ │ - Detects spot interruptions │ │
│ │ - Triggers migrations │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Application Pod │ │
│ │ ┌──────────────┐ ┌────────────────────┐ │ │
│ │ │ App Container│ │ CRIU Agent Sidecar │ │ │
│ │ │ │ │ - gRPC Server │ │ │
│ │ │ your-app │◄──│ - Checkpoint │ │ │
│ │ │ │ │ - Restore │ │ │
│ │ └──────────────┘ └────────────────────┘ │ │
│ └────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Object Storage (S3) │
└─────────────────────────────────────────────────────────┘
- Go: 1.25.3+ (required for building)
- Docker: For building container images
- Protobuf Compiler:
protoc(for generating gRPC code) - controller-gen: For generating CRD manifests
- kubectl: For deploying to Kubernetes
- Kubernetes: v1.20+
- Container Runtime: containerd (with CRIU support) or CRI-O
- Object Storage: S3, MinIO, or GCS
- Linux Kernel: 4.x+ (with CRIU support)
# Install Go 1.25.3 or later
# Ubuntu/Debian example:
wget https://go.dev/dl/go1.25.3.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.25.3.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin
export PATH=$PATH:$(go env GOPATH)/bin
source /etc/profile # Or add to ~/.bashrc# Install protobuf compiler
sudo apt update && sudo apt install -y protobuf-compiler
# Install Go tools
go install google.golang.org/protobuf/cmd/protoc-gen-go@latest
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@latest
go install sigs.k8s.io/controller-tools/cmd/controller-gen@latestcd kubernetes_integration
go mod download
go mod tidy# Generate protobuf code
./scripts/generate-proto.sh
# Or generate manually:
export PATH=$PATH:$(go env GOPATH)/bin
protoc \
--go_out=. \
--go_opt=paths=source_relative \
--go-grpc_out=. \
--go-grpc_opt=paths=source_relative \
pkg/proto/agent.proto# Generate CRD manifests
make manifests
# This creates:
# - config/crd/migration.io_migratableapps.yaml
# - config/rbac/role.yaml# Build all binaries
make build
# Output:
# - bin/agent (CRIU Agent)
# - bin/controller (Migration Controller)
# - bin/node-monitor (Node Monitor)# Download CRIU binary and build all images
make docker-build
# This will:
# 1. Download CRIU binary from S3
# 2. Build agent image (with CRIU)
# 3. Build controller image
# 4. Build node-monitor image
# Images created:
# - 192.168.0.253:5000/criu-agent:latest
# - 192.168.0.253:5000/criu-migration-controller:latest
# - 192.168.0.253:5000/criu-node-monitor:latest# Push all images to registry
make docker-push
# Or customize registry:
make docker-push REGISTRY=your-registry.com/yourorg# Full build from scratch
source /etc/profile
cd kubernetes_integration
# 1. Install dependencies
go mod tidy
# 2. Generate code
./scripts/generate-proto.sh
make manifests
# 3. Build binaries
make build
# 4. Build and push Docker images
make docker-push- Kubernetes cluster running (v1.20+)
kubectlconfigured to access the cluster- Docker images pushed to registry (or use
make docker-push)
# 1. Install CRDs
make install
# 2. Deploy namespace, RBAC, controller and monitor
make deploy
# Or with custom registry:
make deploy REGISTRY=192.168.0.253:5000
# 3. Create storage credentials (see below)Note: The make deploy command will automatically substitute the correct image references based on the REGISTRY variable. If you built and pushed images with a custom registry (e.g., make docker-push REGISTRY=192.168.0.253:5000), use the same REGISTRY value when deploying.
kubectl apply -f config/crd/migration.io_migratableapps.yaml# This will create:
# - Namespace: migration-system
# - ServiceAccount: migration-controller
# - ClusterRole and ClusterRoleBinding
# - Leader election Role and RoleBinding
kubectl apply -f config/rbac/rbac.yaml# Deploy the controller deployment and node-monitor daemonset
kubectl apply -f config/manager/manager.yaml# Check if pods are running
kubectl get pods -n migration-system
# Expected output:
# NAME READY STATUS RESTARTS AGE
# migration-controller-xxxxxxxxxx-xxxxx 1/1 Running 0 30s
# node-monitor-xxxxx 1/1 Running 0 30s
# node-monitor-yyyyy 1/1 Running 0 30sImportant: Create the secret in the migration-system namespace (where the controller is deployed). The controller will automatically inject these credentials into all MigratableApp pods, regardless of which namespace they run in.
For AWS S3:
kubectl create secret generic s3-credentials \
--from-literal=AWS_ACCESS_KEY_ID=your-access-key \
--from-literal=AWS_SECRET_ACCESS_KEY=your-secret-key \
-n migration-systemFor MinIO:
kubectl create secret generic s3-credentials \
--from-literal=AWS_ACCESS_KEY_ID=minioadmin \
--from-literal=AWS_SECRET_ACCESS_KEY=minioadmin \
-n migration-systemNote: Only one secret in migration-system namespace is needed. All MigratableApps in any namespace will use this secret.
# example-app.yaml
apiVersion: migration.io/v1alpha1
kind: MigratableApp
metadata:
name: my-web-app
namespace: default
spec:
template:
metadata:
labels:
app: my-web-app
spec:
containers:
- name: app
image: python:3.9-slim
command: ["python", "-c"]
args:
- |
import time
counter = 0
while True:
counter += 1
print(f"Counter: {counter}")
time.sleep(5)
resources:
requests:
memory: "128Mi"
cpu: "100m"
checkpointPolicy:
interval: "30s"
autoAdjust: true
memoryThresholdMB: 100
maxCheckpointChainDepth: 10
migrationPolicy:
autoMigrate: true
preferOnDemand: true
migrationTimeoutSeconds: 300
storage:
type: s3
bucket: my-checkpoint-bucket
endpoint: http://minio.default.svc.cluster.local:9000
region: us-east-1
credentialsSecret: s3-credentialskubectl apply -f example-app.yaml# Watch MigratableApp status
kubectl get mapp my-web-app -w
# Get detailed status
kubectl describe mapp my-web-app
# View logs
kubectl logs -l migration.io/app=my-web-app -c criu-agent
kubectl logs -l migration.io/app=my-web-app -c app# View checkpoint information
kubectl get mapp my-web-app -o jsonpath='{.status.checkpointStatus}' | jq
# Output example:
# {
# "lastCheckpointID": "abc123-1234567890",
# "lastCheckpointTime": "2024-11-05T08:00:00Z",
# "checkpointChainDepth": 3,
# "checkpointChainRoot": "xyz789-1234567890"
# }kubectl get mapp my-web-app -o jsonpath='{.status.migrationHistory}' | jq
# Output example:
# [
# {
# "fromNode": "node-1",
# "toNode": "node-2",
# "timestamp": "2024-11-05T08:05:00Z",
# "reason": "spot-interrupt",
# "duration": "15.2s",
# "success": true
# }
# ]# Add migration trigger annotation
POD_NAME=$(kubectl get pod -l migration.io/app=my-web-app -o jsonpath='{.items[0].metadata.name}')
kubectl annotate pod $POD_NAME migration.io/trigger=requested
kubectl annotate pod $POD_NAME migration.io/reason=manualcheckpointPolicy:
# Interval between pre-checkpoints
interval: "30s"
# Automatically adjust interval based on memory changes
autoAdjust: true
# Trigger checkpoint when memory changes exceed this threshold (MB)
memoryThresholdMB: 100
# Maximum checkpoint chain depth before full checkpoint
maxCheckpointChainDepth: 10migrationPolicy:
# Enable automatic migration on spot interrupt
autoMigrate: true
# Node selector for migration target
targetNodeSelector:
node-type: on-demand
# Prefer on-demand nodes over spot
preferOnDemand: true
# Migration timeout (seconds)
migrationTimeoutSeconds: 300AWS S3:
storage:
type: s3
bucket: my-bucket
region: us-east-1
credentialsSecret: aws-credentialsMinIO:
storage:
type: minio
bucket: my-bucket
endpoint: http://minio.default.svc.cluster.local:9000
region: us-east-1
credentialsSecret: minio-credentialsGCS:
storage:
type: gcs
bucket: my-bucket
credentialsSecret: gcs-credentials# Development
make help # Show all available targets
make generate # Generate protobuf and deepcopy code
make fmt # Format Go code
make vet # Run Go vet
make test # Run tests
# Build
make build # Build binaries (agent, controller, node-monitor)
# Docker
make download-criu # Download CRIU binary from S3
make docker-build # Build Docker images (includes download-criu)
make docker-push # Build and push Docker images
# Deployment
make manifests # Generate CRD and RBAC manifests
make install # Install CRDs to cluster
make uninstall # Uninstall CRDs from cluster
make deploy # Deploy controller and monitor
make undeploy # Remove controller and monitor
# Dependencies
make controller-gen # Install controller-gen
make protoc-gen-go # Install protoc-gen-go
make protoc-gen-go-grpc # Install protoc-gen-go-grpc# Use custom CRIU binary URL
make docker-build CRIU_URL=https://your-server.com/criu# Build and push with custom registry
make docker-push REGISTRY=your-registry.com/yourorg
# Deploy with same custom registry
make deploy REGISTRY=your-registry.com/yourorg
# Complete workflow:
make docker-push REGISTRY=192.168.0.253:5000
make deploy REGISTRY=192.168.0.253:5000The REGISTRY variable affects:
- docker-build/docker-push: Sets the image tags for building and pushing
- deploy: Automatically replaces image references in deployment YAML before applying to cluster
Edit the Makefile:
AGENT_IMG ?= $(REGISTRY)/criu-agent:v1.0.0
CONTROLLER_IMG ?= $(REGISTRY)/criu-migration-controller:v1.0.0
MONITOR_IMG ?= $(REGISTRY)/criu-node-monitor:v1.0.0Problem: go: command not found
# Install Go and add to PATH
export PATH=$PATH:/usr/local/go/bin
export PATH=$PATH:$(go env GOPATH)/bin
source /etc/profileProblem: protoc: command not found
# Install protobuf compiler
sudo apt install -y protobuf-compilerProblem: controller-gen: command not found
# Install controller-gen
go install sigs.k8s.io/controller-tools/cmd/controller-gen@latestProblem: CRIU download fails
# Check CRIU URL and download manually
curl -L -o criu/criu https://mhsong-criu-s3-data.s3.us-west-2.amazonaws.com/criu
chmod +x criu/criuProblem: Go version mismatch in Docker
# Dockerfiles use golang:1.25.3-alpine
# Make sure go.mod requires go >= 1.25.1Problem: Agent connection failed
# Check agent pod logs
kubectl logs <pod-name> -c criu-agent
# Verify agent is running
kubectl exec <pod-name> -c criu-agent -- ps aux | grep agentProblem: Checkpoint failed
# Check CRIU logs in the pod
kubectl exec <pod-name> -c criu-agent -- ls /checkpoints
kubectl exec <pod-name> -c criu-agent -- cat /checkpoints/<dump-id>/criu.log
# Verify CRIU is available
kubectl exec <pod-name> -c criu-agent -- criu check --allProblem: Migration timeout
# Increase migration timeout
kubectl edit mapp <app-name>
# Set spec.migrationPolicy.migrationTimeoutSeconds to higher valuekubernetes_integration/
├── api/v1alpha1/ # CRD API definitions
├── cmd/ # Main applications
│ ├── agent/ # CRIU Agent
│ ├── controller/ # Migration Controller
│ └── node-monitor/ # Node Monitor
├── pkg/ # Libraries
│ ├── agent/ # Agent implementation
│ ├── controller/ # Controller implementation
│ ├── scheduler/ # Checkpoint scheduler
│ ├── monitor/ # Spot monitor
│ └── proto/ # gRPC definitions
├── config/ # Kubernetes manifests
│ ├── crd/ # CRD definitions
│ ├── rbac/ # RBAC configs
│ ├── manager/ # Controller deployment
│ └── samples/ # Example applications
├── deploy/ # Dockerfiles
│ ├── agent/
│ ├── controller/
│ └── node-monitor/
├── scripts/ # Build scripts
├── Makefile # Build automation
└── README.md # This file
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Apache License 2.0
For questions or support:
- GitHub: github.com/ddps-lab/criu-migration-operator
- Issues: GitHub Issues