Skip to content

Commit 79fa202

Browse files
Merge branch 'main' into chaos-palybook
2 parents 6d92014 + 2cbcbb9 commit 79fa202

File tree

2 files changed

+44
-33
lines changed

2 files changed

+44
-33
lines changed

README.md

Lines changed: 23 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Kubernetes Control Plane Chaos Testing Playbooks Guide
1+
# Kubernetes Chaos Testing Playbooks Guide
22

33
[English](README.md) | [中文](README_zh.md)
44

@@ -10,7 +10,11 @@ Kubernetes' centralized architecture and declarative management model, while ena
1010
- Control plane overload: In a large OpenAI cluster, deploying a DaemonSet monitoring component triggered control plane failures and coredns overload. The coredns scaling depended on control plane recovery, affecting the data plane and causing global OpenAI service outages.
1111
- Data plane's strong dependency on control plane: In open-source Flink on Kubernetes scenarios, kube-apiserver outages may cause Flink task checkpoint failures and leader election anomalies. In severe cases, it may trigger abnormal exits of all existing task Pods, leading to complete data plane collapse and major incidents.
1212

13-
These cases are not uncommon. The root cause lies in Kubernetes' architecture vulnerability chain - a single component failure or incorrect command can trigger global failures through centralized pathways. To proactively understand the impact duration and severity of control plane failures on services, we should conduct regular fault simulation and assessments to improve failure response capabilities, ensuring Kubernetes environment stability and reliability. This project provides Kubernetes chaos testing capabilities covering scenarios like node shutdown, accidental resource deletion, and control plane component (etcd, kube-apiserver, coredns, etc.) overload/outage.
13+
These cases are not uncommon. The root cause lies in Kubernetes' architecture vulnerability chain - a single component failure or incorrect command can trigger global failures through centralized pathways.
14+
15+
To proactively understand the impact duration and severity of control plane failures on services, we should conduct regular fault simulation and assessments to improve failure response capabilities, ensuring Kubernetes environment stability and reliability.
16+
17+
This project provides Kubernetes chaos testing capabilities covering scenarios like node shutdown, accidental resource deletion, and control plane component (etcd, kube-apiserver, coredns, etc.) overload/outage, it will help you minimize blast radius of cluster failures.
1418

1519
## Prerequisites
1620

@@ -77,21 +81,23 @@ kubectl get workflow
7781
kubectl delete workflow {workflow-name}
7882
```
7983

80-
## Feature Roadmap
81-
82-
| Supported Features | Priority | Status | Planned Release | Description |
83-
|----------------------------------|----------|-------------|-----------------|------------------------------|
84-
| apiserver overload | - | Completed | - | Simulate kube-apiserver high load |
85-
| etcd overload | - | Completed | - | Simulate etcd high load |
86-
| coredns outage | - | Completed | - | Simulate coredns service outage |
87-
| kubernetes-proxy outage | - | Completed | - | Simulate kubernetes-proxy outage |
88-
| accidental deletion scenario | - | Completed | - | Simulate accidental resource deletion |
89-
| kube-apiserver outage | P0 | In Progress | 2025-06-15 | Simulate kube-apiserver outage |
90-
| etcd outage | P0 | In Progress | 2025-06-15 | Simulate etcd cluster failure |
91-
| kube-scheduler outage | P0 | In Progress | 2025-06-15 | Test scheduling behavior during scheduler failure |
92-
| kube-controller-manager outage | P0 | In Progress | 2025-06-15 | Validate controller component failure scenarios |
93-
| cloud-controller-manager outage | P0 | In Progress | 2025-06-15 | Validate controller component failure scenarios |
94-
| master node shutdown | P1 | In Progress | 2025-06-15 | Simulate master node poweroff |
84+
## Roadmap
85+
86+
| Supported Features | Priority | Status | Planned Release | Description |
87+
|--------------------------------------------|----------|-------------|-----------------|-----------------------------------------------------------------|
88+
| apiserver overload | - | Completed | - | Simulate kube-apiserver high load |
89+
| etcd overload | - | Completed | - | Simulate etcd high load |
90+
| apiserver overload (APF) | - | Completed | - | Add Expensive List APF Policy,Simulate kube-apiserver high load |
91+
| etcd overload (ReadCache/Consistent cache) | - | Completed | - | Add Etcd Overload Protect Policy, Simulate etcd high load |
92+
| coredns outage | - | Completed | - | Simulate coredns service outage |
93+
| kubernetes-proxy outage | - | Completed | - | Simulate kubernetes-proxy outage |
94+
| accidental deletion scenario | - | Completed | - | Simulate accidental resource deletion |
95+
| kube-apiserver outage | P0 | In Progress | 2025-06-15 | Simulate kube-apiserver outage |
96+
| etcd outage | P0 | In Progress | 2025-06-15 | Simulate etcd cluster failure |
97+
| kube-scheduler outage | P0 | In Progress | 2025-06-15 | Test scheduling behavior during scheduler failure |
98+
| kube-controller-manager outage | P0 | In Progress | 2025-06-15 | Validate controller component failure scenarios |
99+
| cloud-controller-manager outage | P0 | In Progress | 2025-06-15 | Validate controller component failure scenarios |
100+
| master node shutdown | P1 | In Progress | 2025-06-15 | Simulate master node poweroff |
95101

96102
## FAQ
97103
1. Why use two clusters for fault simulation?

README_zh.md

Lines changed: 21 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# K8s控制面故障演练Playbooks指南
1+
# Kubernetes故障演练Playbooks指南
22

33
[English](README.md) | [中文](README_zh.md)
44

@@ -10,7 +10,11 @@
1010
- 控制面过载:`OpenAI`大集群在部署了`DaemonSet`监控组件后,引发控制面故障、`coredns`过载,`coredns`扩容又依赖控制面恢复,导致数据面受影响,`OpenAI`全球业务出现中断。
1111
- 数据面强依赖控制面:开源`Flink on K8s`场景中,`kube-apiserver`中断可能会导致`flink`任务`checkpoint`失败,选主异常,严重情况下还可能会触发所有存量任务`Pod`异常退出,数据面全线崩溃,进而引发重大故障。
1212

13-
类似的案例并非少数,这些风险的本质是`K8s`架构的脆弱性传导链,一次组件异常、一条错误指令,都可能通过中心化链路引发全局故障。为了提前掌控控制面故障对业务的影响时长和严重程度,我们应定期开展演练与评估,提升业务对故障的应对能力,为`K8s`环境的稳定性和可靠性提供强有力的保障。本项目围绕上述场景提供K8s故障演练相关能力,支持节点关机、资源误删除、控制面组件(如etcd、kube-apiserver、coredns等)过载和停服等演练场景。
13+
类似的案例并非少数,这些风险的本质是`K8s`架构的脆弱性传导链,一次组件异常、一条错误指令,都可能通过中心化链路引发全局故障。
14+
15+
为了提前掌控控制面故障对业务的影响时长和严重程度,我们应定期开展演练与评估,提升业务对故障的应对能力,为`K8s`环境的稳定性和可靠性提供强有力的保障。
16+
17+
本项目围绕上述场景提供K8s故障演练相关能力,支持节点关机、资源误删除、控制面组件(如etcd、kube-apiserver、coredns等)过载和停服等演练场景,帮助你降低集群故障爆炸半径。
1418

1519
## 前置条件
1620

@@ -80,20 +84,21 @@ kubectl delete worflow {workflow-name}
8084

8185
## 功能规划路线图
8286

83-
| 支持功能 | 优先级 | 当前状态 | 计划发布时间 | 描述 |
84-
|---------------------------------|--------|------------|---------------|-----------------------------|
85-
| apiserver高负载演练 | - | 完成 | - | 模拟kube-apiserver服务高负载 |
86-
| etcd高负载演练 | - | 完成 | - | 模拟etcd服务高负载 |
87-
| coredns停服 | - | 完成 | - | 模拟coredns服务中断场景 |
88-
| kubernetes-proxy停服 | - | 完成 | - | 模拟kubernetes-proxy服务中断场景 |
89-
| 资源误删除场景 | - | 完成 | - | 模拟资源被误删除场景 |
90-
| kube-apiserver停服演练 | P0 | 开发中 | 2025-06-15 | 模拟kube-apiserver服务中断场景 |
91-
| etcd停服演练 | P0 | 开发中 | 2025-06-15 | 模拟etcd集群故障场景 |
92-
| kube-scheduler停服演练 | P0 | 开发中 | 2025-06-15 | 测试调度器故障期间的集群调度行为 |
93-
| kube-controller-manager停服演练 | P0 | 开发中 | 2025-06-15 | 验证控制器组件故障场景 |
94-
| cloud-controller-manager停服演练 | P0 | 开发中 | 2025-06-15 | 验证控制器组件故障场景 |
95-
| master节点停机 | P1 | 开发中 | 2025-06-15 | 模拟master关机场景 |
96-
87+
| 支持功能 | 优先级 | 当前状态 | 计划发布时间 | 描述 |
88+
|------------------------------|--------|------------|---------------|---------------------------------------------------|
89+
| apiserver高负载演练 | - | 完成 | - | 模拟kube-apiserver服务高负载 |
90+
| etcd高负载演练 | - | 完成 | - | 模拟etcd服务高负载 |
91+
| apiserver高负载演练(增加APF策略保护) | - | 完成 | - | 增加Expensive List APF过载保护策略,并模拟kube-apiserver服务高负载 |
92+
| etcd高负载演练(增加etcd过载保护策略) | - | 完成 | - | 增加etcd过载保护策略,并模拟etcd服务高负载 |
93+
| coredns停服 | - | 完成 | - | 模拟coredns服务中断场景 |
94+
| kubernetes-proxy停服 | - | 完成 | - | 模拟kubernetes-proxy服务中断场景 |
95+
| 资源误删除场景 | - | 完成 | - | 模拟资源被误删除场景 |
96+
| kube-apiserver停服演练 | P0 | 开发中 | 2025-06-15 | 模拟kube-apiserver服务中断场景 |
97+
| etcd停服演练 | P0 | 开发中 | 2025-06-15 | 模拟etcd集群故障场景 |
98+
| kube-scheduler停服演练 | P0 | 开发中 | 2025-06-15 | 测试调度器故障期间的集群调度行为 |
99+
| kube-controller-manager停服演练 | P0 | 开发中 | 2025-06-15 | 验证控制器组件故障场景 |
100+
| cloud-controller-manager停服演练 | P0 | 开发中 | 2025-06-15 | 验证控制器组件故障场景 |
101+
| master节点停机 | P1 | 开发中 | 2025-06-15 | 模拟master关机场景 |
97102

98103
## 常见问题
99104
1. 为什么要用两个集群来执行演练测试?

0 commit comments

Comments
 (0)