Skip to content

ddps-lab/criu-migration-operator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CRIU Migration Operator for Kubernetes

A Kubernetes operator that enables zero-downtime live migration of applications using CRIU (Checkpoint/Restore In Userspace) with Object Storage integration. Designed specifically for Spot/Preemptible instances.

Overview

This operator provides:

  • Automatic Migration: Detects spot instance interruptions and automatically migrates workloads
  • Incremental Checkpoints: Regular pre-checkpoints with minimal overhead
  • Object Storage Integration: Stores checkpoints in S3/MinIO/GCS for cross-node migration
  • Lazy Page Loading: Fast restore with on-demand page fetching from object storage
  • Kubernetes Native: CRD-based API with familiar kubectl workflows

Architecture

┌─────────────────────────────────────────────────────────┐
│                 Kubernetes Cluster                       │
│                                                          │
│  ┌────────────────────────────────────────────────┐    │
│  │          Migration Controller                  │    │
│  │  - Reconciles MigratableApp resources          │    │
│  │  - Orchestrates migrations                     │    │
│  │  - Manages Pod lifecycle                       │    │
│  └────────────────────────────────────────────────┘    │
│                                                          │
│  ┌────────────────────────────────────────────────┐    │
│  │         Node Monitor (DaemonSet)               │    │
│  │  - Detects spot interruptions                  │    │
│  │  - Triggers migrations                         │    │
│  └────────────────────────────────────────────────┘    │
│                                                          │
│  ┌────────────────────────────────────────────────┐    │
│  │           Application Pod                      │    │
│  │  ┌──────────────┐   ┌────────────────────┐    │    │
│  │  │ App Container│   │ CRIU Agent Sidecar │    │    │
│  │  │              │   │ - gRPC Server      │    │    │
│  │  │ your-app     │◄──│ - Checkpoint       │    │    │
│  │  │              │   │ - Restore          │    │    │
│  │  └──────────────┘   └────────────────────┘    │    │
│  └────────────────────────────────────────────────┘    │
│                       │                                  │
│                       ▼                                  │
│               Object Storage (S3)                       │
└─────────────────────────────────────────────────────────┘

Prerequisites

Development Environment

  • Go: 1.25.3+ (required for building)
  • Docker: For building container images
  • Protobuf Compiler: protoc (for generating gRPC code)
  • controller-gen: For generating CRD manifests
  • kubectl: For deploying to Kubernetes

Kubernetes Cluster

  • Kubernetes: v1.20+
  • Container Runtime: containerd (with CRIU support) or CRI-O
  • Object Storage: S3, MinIO, or GCS
  • Linux Kernel: 4.x+ (with CRIU support)

Building from Source

Step 1: Install Go Dependencies

# Install Go 1.25.3 or later
# Ubuntu/Debian example:
wget https://go.dev/dl/go1.25.3.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.25.3.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin
export PATH=$PATH:$(go env GOPATH)/bin
source /etc/profile  # Or add to ~/.bashrc

Step 2: Install Build Tools

# Install protobuf compiler
sudo apt update && sudo apt install -y protobuf-compiler

# Install Go tools
go install google.golang.org/protobuf/cmd/protoc-gen-go@latest
go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@latest
go install sigs.k8s.io/controller-tools/cmd/controller-gen@latest

Step 3: Download Dependencies

cd kubernetes_integration
go mod download
go mod tidy

Step 4: Generate Code

# Generate protobuf code
./scripts/generate-proto.sh

# Or generate manually:
export PATH=$PATH:$(go env GOPATH)/bin
protoc \
  --go_out=. \
  --go_opt=paths=source_relative \
  --go-grpc_out=. \
  --go-grpc_opt=paths=source_relative \
  pkg/proto/agent.proto

Step 5: Generate Kubernetes Manifests

# Generate CRD manifests
make manifests

# This creates:
# - config/crd/migration.io_migratableapps.yaml
# - config/rbac/role.yaml

Step 6: Build Binaries

# Build all binaries
make build

# Output:
# - bin/agent         (CRIU Agent)
# - bin/controller    (Migration Controller)
# - bin/node-monitor  (Node Monitor)

Step 7: Build Docker Images

# Download CRIU binary and build all images
make docker-build

# This will:
# 1. Download CRIU binary from S3
# 2. Build agent image (with CRIU)
# 3. Build controller image
# 4. Build node-monitor image

# Images created:
# - 192.168.0.253:5000/criu-agent:latest
# - 192.168.0.253:5000/criu-migration-controller:latest
# - 192.168.0.253:5000/criu-node-monitor:latest

Step 8: Push to Registry (Optional)

# Push all images to registry
make docker-push

# Or customize registry:
make docker-push REGISTRY=your-registry.com/yourorg

Complete Build Workflow

# Full build from scratch
source /etc/profile
cd kubernetes_integration

# 1. Install dependencies
go mod tidy

# 2. Generate code
./scripts/generate-proto.sh
make manifests

# 3. Build binaries
make build

# 4. Build and push Docker images
make docker-push

Installation

Prerequisites

  • Kubernetes cluster running (v1.20+)
  • kubectl configured to access the cluster
  • Docker images pushed to registry (or use make docker-push)

Method 1: Using Makefile (Recommended)

# 1. Install CRDs
make install

# 2. Deploy namespace, RBAC, controller and monitor
make deploy

# Or with custom registry:
make deploy REGISTRY=192.168.0.253:5000

# 3. Create storage credentials (see below)

Note: The make deploy command will automatically substitute the correct image references based on the REGISTRY variable. If you built and pushed images with a custom registry (e.g., make docker-push REGISTRY=192.168.0.253:5000), use the same REGISTRY value when deploying.

Method 2: Manual Installation

Step 1: Install CRDs

kubectl apply -f config/crd/migration.io_migratableapps.yaml

Step 2: Create Namespace, ServiceAccount and RBAC

# This will create:
# - Namespace: migration-system
# - ServiceAccount: migration-controller
# - ClusterRole and ClusterRoleBinding
# - Leader election Role and RoleBinding
kubectl apply -f config/rbac/rbac.yaml

Step 3: Deploy Controller and Node Monitor

# Deploy the controller deployment and node-monitor daemonset
kubectl apply -f config/manager/manager.yaml

Step 4: Verify Installation

# Check if pods are running
kubectl get pods -n migration-system

# Expected output:
# NAME                                     READY   STATUS    RESTARTS   AGE
# migration-controller-xxxxxxxxxx-xxxxx    1/1     Running   0          30s
# node-monitor-xxxxx                       1/1     Running   0          30s
# node-monitor-yyyyy                       1/1     Running   0          30s

Step 5: Configure Object Storage Credentials

Important: Create the secret in the migration-system namespace (where the controller is deployed). The controller will automatically inject these credentials into all MigratableApp pods, regardless of which namespace they run in.

For AWS S3:

kubectl create secret generic s3-credentials \
  --from-literal=AWS_ACCESS_KEY_ID=your-access-key \
  --from-literal=AWS_SECRET_ACCESS_KEY=your-secret-key \
  -n migration-system

For MinIO:

kubectl create secret generic s3-credentials \
  --from-literal=AWS_ACCESS_KEY_ID=minioadmin \
  --from-literal=AWS_SECRET_ACCESS_KEY=minioadmin \
  -n migration-system

Note: Only one secret in migration-system namespace is needed. All MigratableApps in any namespace will use this secret.

Quick Start

1. Create a MigratableApp

# example-app.yaml
apiVersion: migration.io/v1alpha1
kind: MigratableApp
metadata:
  name: my-web-app
  namespace: default
spec:
  template:
    metadata:
      labels:
        app: my-web-app
    spec:
      containers:
      - name: app
        image: python:3.9-slim
        command: ["python", "-c"]
        args:
        - |
          import time
          counter = 0
          while True:
              counter += 1
              print(f"Counter: {counter}")
              time.sleep(5)
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"

  checkpointPolicy:
    interval: "30s"
    autoAdjust: true
    memoryThresholdMB: 100
    maxCheckpointChainDepth: 10

  migrationPolicy:
    autoMigrate: true
    preferOnDemand: true
    migrationTimeoutSeconds: 300

  storage:
    type: s3
    bucket: my-checkpoint-bucket
    endpoint: http://minio.default.svc.cluster.local:9000
    region: us-east-1
    credentialsSecret: s3-credentials

2. Deploy the Application

kubectl apply -f example-app.yaml

3. Monitor the Application

# Watch MigratableApp status
kubectl get mapp my-web-app -w

# Get detailed status
kubectl describe mapp my-web-app

# View logs
kubectl logs -l migration.io/app=my-web-app -c criu-agent
kubectl logs -l migration.io/app=my-web-app -c app

4. Check Checkpoint Status

# View checkpoint information
kubectl get mapp my-web-app -o jsonpath='{.status.checkpointStatus}' | jq

# Output example:
# {
#   "lastCheckpointID": "abc123-1234567890",
#   "lastCheckpointTime": "2024-11-05T08:00:00Z",
#   "checkpointChainDepth": 3,
#   "checkpointChainRoot": "xyz789-1234567890"
# }

5. View Migration History

kubectl get mapp my-web-app -o jsonpath='{.status.migrationHistory}' | jq

# Output example:
# [
#   {
#     "fromNode": "node-1",
#     "toNode": "node-2",
#     "timestamp": "2024-11-05T08:05:00Z",
#     "reason": "spot-interrupt",
#     "duration": "15.2s",
#     "success": true
#   }
# ]

6. Trigger Manual Migration

# Add migration trigger annotation
POD_NAME=$(kubectl get pod -l migration.io/app=my-web-app -o jsonpath='{.items[0].metadata.name}')
kubectl annotate pod $POD_NAME migration.io/trigger=requested
kubectl annotate pod $POD_NAME migration.io/reason=manual

Configuration

Checkpoint Policy

checkpointPolicy:
  # Interval between pre-checkpoints
  interval: "30s"

  # Automatically adjust interval based on memory changes
  autoAdjust: true

  # Trigger checkpoint when memory changes exceed this threshold (MB)
  memoryThresholdMB: 100

  # Maximum checkpoint chain depth before full checkpoint
  maxCheckpointChainDepth: 10

Migration Policy

migrationPolicy:
  # Enable automatic migration on spot interrupt
  autoMigrate: true

  # Node selector for migration target
  targetNodeSelector:
    node-type: on-demand

  # Prefer on-demand nodes over spot
  preferOnDemand: true

  # Migration timeout (seconds)
  migrationTimeoutSeconds: 300

Storage Configuration

AWS S3:

storage:
  type: s3
  bucket: my-bucket
  region: us-east-1
  credentialsSecret: aws-credentials

MinIO:

storage:
  type: minio
  bucket: my-bucket
  endpoint: http://minio.default.svc.cluster.local:9000
  region: us-east-1
  credentialsSecret: minio-credentials

GCS:

storage:
  type: gcs
  bucket: my-bucket
  credentialsSecret: gcs-credentials

Makefile Targets

# Development
make help              # Show all available targets
make generate          # Generate protobuf and deepcopy code
make fmt               # Format Go code
make vet               # Run Go vet
make test              # Run tests

# Build
make build             # Build binaries (agent, controller, node-monitor)

# Docker
make download-criu     # Download CRIU binary from S3
make docker-build      # Build Docker images (includes download-criu)
make docker-push       # Build and push Docker images

# Deployment
make manifests         # Generate CRD and RBAC manifests
make install           # Install CRDs to cluster
make uninstall         # Uninstall CRDs from cluster
make deploy            # Deploy controller and monitor
make undeploy          # Remove controller and monitor

# Dependencies
make controller-gen    # Install controller-gen
make protoc-gen-go     # Install protoc-gen-go
make protoc-gen-go-grpc # Install protoc-gen-go-grpc

Customization

Custom CRIU Binary

# Use custom CRIU binary URL
make docker-build CRIU_URL=https://your-server.com/criu

Custom Registry

# Build and push with custom registry
make docker-push REGISTRY=your-registry.com/yourorg

# Deploy with same custom registry
make deploy REGISTRY=your-registry.com/yourorg

# Complete workflow:
make docker-push REGISTRY=192.168.0.253:5000
make deploy REGISTRY=192.168.0.253:5000

The REGISTRY variable affects:

  • docker-build/docker-push: Sets the image tags for building and pushing
  • deploy: Automatically replaces image references in deployment YAML before applying to cluster

Custom Image Tags

Edit the Makefile:

AGENT_IMG ?= $(REGISTRY)/criu-agent:v1.0.0
CONTROLLER_IMG ?= $(REGISTRY)/criu-migration-controller:v1.0.0
MONITOR_IMG ?= $(REGISTRY)/criu-node-monitor:v1.0.0

Troubleshooting

Build Issues

Problem: go: command not found

# Install Go and add to PATH
export PATH=$PATH:/usr/local/go/bin
export PATH=$PATH:$(go env GOPATH)/bin
source /etc/profile

Problem: protoc: command not found

# Install protobuf compiler
sudo apt install -y protobuf-compiler

Problem: controller-gen: command not found

# Install controller-gen
go install sigs.k8s.io/controller-tools/cmd/controller-gen@latest

Docker Build Issues

Problem: CRIU download fails

# Check CRIU URL and download manually
curl -L -o criu/criu https://mhsong-criu-s3-data.s3.us-west-2.amazonaws.com/criu
chmod +x criu/criu

Problem: Go version mismatch in Docker

# Dockerfiles use golang:1.25.3-alpine
# Make sure go.mod requires go >= 1.25.1

Runtime Issues

Problem: Agent connection failed

# Check agent pod logs
kubectl logs <pod-name> -c criu-agent

# Verify agent is running
kubectl exec <pod-name> -c criu-agent -- ps aux | grep agent

Problem: Checkpoint failed

# Check CRIU logs in the pod
kubectl exec <pod-name> -c criu-agent -- ls /checkpoints
kubectl exec <pod-name> -c criu-agent -- cat /checkpoints/<dump-id>/criu.log

# Verify CRIU is available
kubectl exec <pod-name> -c criu-agent -- criu check --all

Problem: Migration timeout

# Increase migration timeout
kubectl edit mapp <app-name>
# Set spec.migrationPolicy.migrationTimeoutSeconds to higher value

Project Structure

kubernetes_integration/
├── api/v1alpha1/              # CRD API definitions
├── cmd/                        # Main applications
│   ├── agent/                 # CRIU Agent
│   ├── controller/            # Migration Controller
│   └── node-monitor/          # Node Monitor
├── pkg/                        # Libraries
│   ├── agent/                 # Agent implementation
│   ├── controller/            # Controller implementation
│   ├── scheduler/             # Checkpoint scheduler
│   ├── monitor/               # Spot monitor
│   └── proto/                 # gRPC definitions
├── config/                     # Kubernetes manifests
│   ├── crd/                   # CRD definitions
│   ├── rbac/                  # RBAC configs
│   ├── manager/               # Controller deployment
│   └── samples/               # Example applications
├── deploy/                     # Dockerfiles
│   ├── agent/
│   ├── controller/
│   └── node-monitor/
├── scripts/                    # Build scripts
├── Makefile                    # Build automation
└── README.md                   # This file

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

Apache License 2.0

References

Contact

For questions or support:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published