Skip to content

Commit 3275c6b

Browse files
authored
Merge pull request #10 from SQxiaoxiaomeng/fix-stress-parameters
update README and fix resource-create-object-type error
2 parents 594fd36 + 6affda6 commit 3275c6b

File tree

7 files changed

+72
-80
lines changed

7 files changed

+72
-80
lines changed

README.md

Lines changed: 18 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,14 @@
77
Kubernetes' centralized architecture and declarative management model, while enabling efficient operations, also introduce critical risks of cascading failures. The open ecosystem (with third-party components like Flink and Rancher) and complex multi-service environments further exacerbate these risks:
88

99
- Cascading deletion disaster: A customer using Rancher to manage Kubernetes clusters accidentally deleted a namespace, which subsequently deleted all core business workloads and Pods in the production cluster, causing service interruption.
10-
- Control plane overload: In a large OpenAI cluster, deploying a DaemonSet monitoring component triggered control plane failures and coredns overload. The coredns scaling depended on control plane recovery, affecting the data plane and causing global OpenAI service outages.
11-
- Data plane's strong dependency on control plane: In open-source Flink on Kubernetes scenarios, kube-apiserver outages may cause Flink task checkpoint failures and leader election anomalies. In severe cases, it may trigger abnormal exits of all existing task Pods, leading to complete data plane collapse and major incidents.
10+
- Control plane overload: In a large OpenAI cluster, deploying a DaemonSet monitoring component triggered control plane failures and coredns overload. The coredns scaling depended on control plane recovery, affecting the data plane and causing global OpenAI service disruption.
11+
- Data plane's strong dependency on control plane: In open-source Flink on Kubernetes scenarios, kube-apiserver disruption may cause Flink task checkpoint failures and leader election anomalies. In severe cases, it may trigger abnormal exits of all existing task Pods, leading to complete data plane collapse and major incidents.
1212

1313
These cases are not uncommon. The root cause lies in Kubernetes' architecture vulnerability chain - a single component failure or incorrect command can trigger global failures through centralized pathways.
1414

1515
To proactively understand the impact duration and severity of control plane failures on services, we should conduct regular fault simulation and assessments to improve failure response capabilities, ensuring Kubernetes environment stability and reliability.
1616

17-
This project provides Kubernetes chaos testing capabilities covering scenarios like node shutdown, accidental resource deletion, and control plane component (etcd, kube-apiserver, coredns, etc.) overload/outage, it will help you minimize blast radius of cluster failures.
17+
This project provides Kubernetes chaos testing capabilities covering scenarios like node shutdown, accidental resource deletion, and control plane component (etcd, kube-apiserver, coredns, etc.) overload/disruption, it will help you minimize blast radius of cluster failures.
1818

1919
## Prerequisites
2020

@@ -47,6 +47,9 @@ kubectl get po -n tke-chaos-test
4747
```
4848

4949
5. Enable public access for `tke-chaos-test/tke-chaos-argo-workflows-server Service` in Tencent Cloud TKE Console. Access Argo Server UI at `LoadBalancer IP:2746` using credentials obtained via:
50+
51+
Note: If the cluster restricts public access, please configure the Service for internal access and connect via internal network.
52+
5053
```bash
5154
# Get Argo Server UI access token
5255
kubectl exec -it -n tke-chaos-test deployment/tke-chaos-argo-workflows-server -- argo auth token
@@ -62,7 +65,8 @@ Using `kube-apiserver overload` as an example:
6265

6366
- Create kube-apiserver overload workflow:
6467
```bash
65-
kubectl create -f playbook/rbac.yaml && kubectl create -f playbook/all-in-one-template.yaml && kubectl create -f playbook/workflow/apiserver-overload-scenario.yaml
68+
kubectl create -f playbook/rbac.yaml && kubectl create -f playbook/all-in-one-template.yaml
69+
kubectl create -f playbook/workflow/apiserver-overload-scenario.yaml
6670
```
6771

6872
![apiserver overload flowchart](./playbook/docs/chaos-flowchart-en.png)
@@ -74,11 +78,9 @@ kubectl create -f playbook/rbac.yaml && kubectl create -f playbook/all-in-one-te
7478
- **Execute Testing**: During kube-apiserver overload testing, the system floods `dest cluster`'s kube-apiserver with List Pod requests to simulate high load. Monitor kube-apiserver metrics via Tencent Cloud TKE Console and observe your business Pod health during testing.
7579
- **Result Processing**: View testing results in Argo Server UI (recommended) or via `kubectl describe workflow {workflow-name}`.
7680

77-
### Stopping Tests
81+
### Deleting Tests
7882
```bash
79-
# Stop tests
80-
kubectl get workflow
81-
kubectl delete workflow {workflow-name}
83+
kubectl delete -f playbook/workflow/apiserver-overload-scenario.yaml
8284
```
8385

8486
## Roadmap
@@ -89,20 +91,19 @@ kubectl delete workflow {workflow-name}
8991
| etcd overload | - | Completed | - | Simulate etcd high load |
9092
| apiserver overload (APF) | - | Completed | - | Add Expensive List APF Policy,Simulate kube-apiserver high load |
9193
| etcd overload (ReadCache/Consistent cache) | - | Completed | - | Add Etcd Overload Protect Policy, Simulate etcd high load |
92-
| coredns outage | - | Completed | - | Simulate coredns service outage |
93-
| kubernetes-proxy outage | - | Completed | - | Simulate kubernetes-proxy outage |
94+
| coredns disruption | - | Completed | - | Simulate coredns service disruption |
95+
| kubernetes-proxy disruption | - | Completed | - | Simulate kubernetes-proxy disruption |
9496
| accidental deletion scenario | - | Completed | - | Simulate accidental resource deletion |
95-
| kube-apiserver outage | P0 | In Progress | 2025-06-15 | Simulate kube-apiserver outage |
96-
| etcd outage | P0 | In Progress | 2025-06-15 | Simulate etcd cluster failure |
97-
| kube-scheduler outage | P0 | In Progress | 2025-06-15 | Test scheduling behavior during scheduler failure |
98-
| kube-controller-manager outage | P0 | In Progress | 2025-06-15 | Validate controller component failure scenarios |
99-
| cloud-controller-manager outage | P0 | In Progress | 2025-06-15 | Validate controller component failure scenarios |
100-
| master node shutdown | P1 | In Progress | 2025-06-15 | Simulate master node poweroff |
97+
| TKE managed cluster kube-apiserver disruption | - | Completed | - | Simulate kube-apiserver disruption |
98+
| TKE managed cluster kube-scheduler disruption | - | Completed | - | Test scheduling behavior during scheduler failure |
99+
| TKE managed cluster kube-controller-manager disruption | - | Completed | - | Validate controller component failure scenarios |
100+
| TKE Self-Maintenance Cluster master node shutdown | P1 | In Progress | 2025-06-30 | Simulate master node poweroff |
101+
| etcd disruption | P1 | In Progress | 2025-06-30 | Simulate etcd cluster failure |
101102

102103
## FAQ
103104
1. Why use two clusters for fault simulation?
104105

105-
Testings are orchestrated using Argo Workflow, which follows a CRD-based pattern heavily dependent on kube-apiserver. Using a single cluster for fault simulation (especially apiserver/etcd overload or outage tests) would make kube-apiserver unavailable, preventing Argo Workflow Controller from functioning and halting the entire workflow.
106+
Testings are orchestrated using Argo Workflow, which follows a CRD-based pattern heavily dependent on kube-apiserver. Using a single cluster for fault simulation (especially apiserver/etcd overload or disruption tests) would make kube-apiserver unavailable, preventing Argo Workflow Controller from functioning and halting the entire workflow.
106107

107108
2. How to track testing progress after starting?
108109

README_zh.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,10 @@ kubectl create -f playbook/install-argo.yaml
4646
kubectl get po -n tke-chaos-test
4747
```
4848

49-
5. 腾讯云`TKE控制台`开启`tke-chaos-test/tke-chaos-argo-workflows-server Service`公网访问,浏览器访问`LoadBalancer IP:2746`,执行如下命令获取的`Argo Server UI`接入凭证登录`Argo UI``Argo UI`可查看演练流程的详细信息。
49+
5. 腾讯云`TKE控制台`开启`tke-chaos-test/tke-chaos-argo-workflows-server Service`公网访问,浏览器访问`LoadBalancer IP:2746`。执行如下命令获取的`Argo Server UI`接入凭证登录`Argo UI``Argo UI`可用于查看演练流程的详细信息。
50+
51+
注:若集群限制公网访问,请配置Service内网访问并通过内网访问。
52+
5053
```bash
5154
# 获取Argo Server UI接入凭证
5255
kubectl exec -it -n tke-chaos-test deployment/tke-chaos-argo-workflows-server -- argo auth token
@@ -62,11 +65,11 @@ kubectl exec -it -n tke-chaos-test deployment/tke-chaos-argo-workflows-server --
6265

6366
- 创建`kube-apiserver`高负载故障演练`workflow`
6467
```bash
65-
kubectl create -f playbook/rbac.yaml && kubectl create -f playbook/all-in-one-template.yaml && kubectl create -f playbook/workflow/apiserver-overload-scenario.yaml
68+
kubectl create -f playbook/rbac.yaml && kubectl create -f playbook/all-in-one-template.yaml
69+
kubectl create -f playbook/workflow/apiserver-overload-scenario.yaml
6670
```
6771

68-
![apiserver高负载演练流程图](./playbook/docs/chaos-flowchart-zh.png)
69-
72+
![演练流程图](./playbook/docs/chaos-flowchart-zh.png)
7073

7174
**核心流程说明**
7275

@@ -75,11 +78,9 @@ kubectl create -f playbook/rbac.yaml && kubectl create -f playbook/all-in-one-te
7578
- **执行演练**`kube-apiserver`高负载演练执行过程中,会对`目标集群``kube-apiserver`发起大量的洪泛`List Pod`请求,以模拟`kube-apiserver`高负载场景,您可以访问`腾讯云TKE控制台``目标集群`核心组件监控,查看`kube-apiserver`的负载情况。同时,您应该关注演练过程中您的业务Pod的健康状态,以验证`kube-apiserver`高负载是否会影响您的业务。
7679
- **演练结果**:您可以访问`Argo Server UI`查看演练结果(推荐),您也可以执行`kubectl describe workflow {workflow-name}`查看演练结果。
7780

78-
### 停止测试
81+
### 删除演练
7982
```bash
80-
# 停止测试
81-
kubectl get workflow
82-
kubectl delete worflow {workflow-name}
83+
kubectl delete -f playbook/workflow/apiserver-overload-scenario.yaml
8384
```
8485

8586
## 功能规划路线图
@@ -93,12 +94,11 @@ kubectl delete worflow {workflow-name}
9394
| coredns停服 | - | 完成 | - | 模拟coredns服务中断场景 |
9495
| kubernetes-proxy停服 | - | 完成 | - | 模拟kubernetes-proxy服务中断场景 |
9596
| 资源误删除场景 | - | 完成 | - | 模拟资源被误删除场景 |
96-
| kube-apiserver停服演练 | P0 | 开发中 | 2025-06-15 | 模拟kube-apiserver服务中断场景 |
97-
| etcd停服演练 | P0 | 开发中 | 2025-06-15 | 模拟etcd集群故障场景 |
98-
| kube-scheduler停服演练 | P0 | 开发中 | 2025-06-15 | 测试调度器故障期间的集群调度行为 |
99-
| kube-controller-manager停服演练 | P0 | 开发中 | 2025-06-15 | 验证控制器组件故障场景 |
100-
| cloud-controller-manager停服演练 | P0 | 开发中 | 2025-06-15 | 验证控制器组件故障场景 |
101-
| master节点停机 | P1 | 开发中 | 2025-06-15 | 模拟master关机场景 |
97+
| TKE托管集群kube-apiserver停服演练 | - | 完成 | - | 模拟kube-apiserver服务中断场景 |
98+
| TKE托管集群kube-scheduler停服演练 | - | 完成 | - | 测试调度器故障期间的集群调度行为 |
99+
| TKE托管集群kube-controller-manager停服演练 | - | 完成 | - | 验证控制器组件故障场景 |
100+
| TKE自维护集群master节点停机 | P1 | 开发中 | 2025-06-30 | 模拟master关机场景 |
101+
| etcd停服演练 | P1 | 开发中 | 2025-06-30 | 模拟etcd集群故障场景 |
102102

103103
## 常见问题
104104
1. 为什么要用两个集群来执行演练测试?

playbook/README.md

Lines changed: 11 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -100,20 +100,11 @@ This scenario tests Tencent Cloud TKE's namespace deletion block policy with the
100100

101101
Tencent Cloud TKE supports various resource protection policies, such as CRD deletion protection, PV deletion protection, etc. You can refer to the official Tencent Cloud documentation for more details: [Policy Management](https://cloud.tencent.com/document/product/457/103179)
102102

103-
## TKE Self-maintenance of Master cluster's kube-apiserver Disruption
104-
TODO
105-
106-
## TKE Self-maintenance of Master cluster's etcd Disruption
107-
TODO
108-
109-
## TKE Self-maintenance of Master cluster's kube-controller-manager Disruption
110-
TODO
111-
112-
## TKE Self-maintenance of Master cluster's kube-scheduler Disruption
113-
TODO
114-
115103
## Managed Cluster Master Component Disruption
116104

105+
1. Your cluster name must contain the words `Chaos Experiment` or `混沌演练` and the cluster size must be smaller than `L1000`, otherwise the Tencent Cloud API call will fail
106+
2. You need to modify the `region`, `secret-id`, `secret-key`, and `cluster-id` parameters in the YAML file ([Parameter Explanation](#managed-cluster-master-component-parameters))
107+
117108
**playbooks**:
118109
1. kube-apiserver disruption: `workflow/managed-cluster-apiserver-shutdown-scenario.yaml`
119110
2. kube-controller-manager disruption: `workflow/managed-cluster-controller-manager-shutdown-scenario.yaml`
@@ -144,16 +135,20 @@ kubectl create -f workflow/managed-cluster-master-component/shutdown-apiserver.y
144135
kubectl create -f workflow/managed-cluster-master-component/restore-apiserver.yaml
145136
```
146137

138+
<a id="managed-cluster-master-component-parameters"></a>
147139
**Parameters**
148140

149141
| Parameter | Type | Default | Description |
150142
|-----------|------|---------|-------------|
151-
| `region` | `string` | <REGION> | Tencent Cloud region, e.g. `ap-guangzhou` [Region List](https://www.tencentcloud.com/document/product/213/6091?lang=en&pg=) |
152-
| `secret-id` | `string` | <SECRET_ID> | Tencent Cloud API secret ID, obtain from [API Key Management](https://console.cloud.tencent.com/cam/capi) |
153-
| `secret-key` | `string` | <SECRET_KEY> | Tencent Cloud API secret key |
154-
| `cluster-id` | `string` | <CLUSTER_ID> | Target cluster ID |
143+
| `region` | `string` | `<REGION>` | Tencent Cloud region, e.g. `ap-guangzhou` [Region List](https://www.tencentcloud.com/document/product/213/6091?lang=en&pg=) |
144+
| `secret-id` | `string` | `<SECRET_ID>` | Tencent Cloud API secret ID, obtain from [API Key Management](https://console.cloud.tencent.com/cam/capi) |
145+
| `secret-key` | `string` | `<SECRET_KEY>` | Tencent Cloud API secret key |
146+
| `cluster-id` | `string` | `<CLUSTER_ID>` | Target cluster ID |
155147
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | Secret name containing target cluster kubeconfig |
156148

157149
**Notes**
158150
1. Will affect master component availability during test
159151
2. Recommended to execute in non-production environments or maintenance windows
152+
153+
## Self-Maintenance Cluster Master Component Disruption
154+
TODO

playbook/README_zh.md

Lines changed: 11 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -100,20 +100,11 @@
100100

101101
腾讯云TKE支持大量的资源防护策略,如`CRD`删除保护、`PV`删除保护等,您可以访问腾讯云官方文档以查看详细信息[策略管理](https://cloud.tencent.com/document/product/457/103179)
102102

103-
## TKE Master自维护集群kube-apiserver停服
104-
TODO
105-
106-
## TKE Master自维护集群etcd停服
107-
TODO
108-
109-
## TKE Master自维护集群kube-controller-manager停服
110-
TODO
111-
112-
## TKE Master自维护集群kube-scheduler停服
113-
TODO
114-
115103
## 托管集群master组件停服
116104

105+
1. 您的集群名称中需要包含`Chaos Experiment``混沌演练`字样且集群规模小于`L1000`,否则腾讯云API将会调用失败
106+
2. 您需要修改演练`YAML`文件中`region``secret-id``secret-key``cluster-id`参数([参数说明](#托管集群master组件停服参数说明))
107+
117108
**playbook**
118109
1. kube-apiserver停服&恢复:`workflow/managed-cluster-apiserver-shutdown-scenario.yaml`
119110
2. kube-controller-manager停服&恢复:`workflow/managed-cluster-controller-manager-shutdown-scenario.yaml`
@@ -143,20 +134,21 @@ kubectl create -f workflow/managed-cluster-master-component/shutdown-apiserver.y
143134
```bash
144135
kubectl create -f workflow/managed-cluster-master-component/restore-apiserver.yaml
145136
```
146-
137+
<a id="托管集群master组件停服参数说明"></a>
147138
**参数说明**
148139

149-
您需要修改演练`YAML`文件中`region``secret-id``secret-key``cluster-id`参数,参数说明如下:
150-
151140
| 参数名称 | 类型 | 默认值 | 说明 |
152141
|---------|------|--------|------|
153-
| `region` | `string` | <REGION> | 腾讯云地域,如`ap-guangzhou` [地域查询](https://www.tencentcloud.com/zh/document/product/213/6091) |
154-
| `secret-id` | `string` | <SECRET_ID> | 腾讯云API密钥ID, 密钥可前往官网控制台 [API密钥管理](https://console.cloud.tencent.com/cam/capi) 进行获取 |
155-
| `secret-key` | `string` | <SECRET_KEY> | 腾讯云API密钥 |
156-
| `cluster-id` | `string` | <CLUSTER_ID> | 演练集群ID |
142+
| `region` | `string` | `<REGION>` | 腾讯云地域,如`ap-guangzhou` [地域查询](https://www.tencentcloud.com/zh/document/product/213/6091) |
143+
| `secret-id` | `string` | `<SECRET_ID>` | 腾讯云API密钥ID, 密钥可前往官网控制台 [API密钥管理](https://console.cloud.tencent.com/cam/capi) 进行获取 |
144+
| `secret-key` | `string` | `<SECRET_KEY>` | 腾讯云API密钥 |
145+
| `cluster-id` | `string` | `<CLUSTER_ID>` | 演练集群ID |
157146
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | 目标集群kubeconfig secret名称 |
158147

159148
**注意事项**
160149

161150
2. 演练过程中会影响集群`master`组件服务可用性
162151
3. 建议在非生产环境或维护窗口期执行
152+
153+
## 自维护集群master组件停服
154+
TODO

playbook/all-in-one-template.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -979,7 +979,7 @@ spec:
979979
name: resource-archestrate
980980
template: resource-create
981981
clusterScope: true
982-
when: "'{{inputs.parameters.enable-resource-create}}' == 'true'"
982+
when: "{{steps.precheck.status}} == Succeeded && '{{inputs.parameters.enable-resource-create}}' == 'true'"
983983

984984
- - name: notify-inject-stress # 通知: 开始注入故障
985985
continueOn:
@@ -1110,14 +1110,14 @@ spec:
11101110
arguments:
11111111
parameters:
11121112
- name: cmd
1113-
value: "delete -n {{inputs.parameters.resource-create-namespace}} {{inputs.parameters.resource-create-object-type}} --all"
1113+
value: "delete -n {{inputs.parameters.resource-create-namespace}} {{inputs.parameters.resource-create-object-type}} -l kubestress"
11141114
- name: kubeconfig-secret-name
11151115
value: "{{inputs.parameters.kubeconfig-secret-name}}"
11161116
templateRef:
11171117
name: kubectl-cmd
11181118
template: kubectl-cmd
11191119
clusterScope: true
1120-
when: "'{{inputs.parameters.enable-resource-create}}' == 'true'"
1120+
when: "{{steps.precheck.status}} == Succeeded && '{{inputs.parameters.enable-resource-create}}' == 'true'"
11211121

11221122

11231123
- name: metrics-collect-then-notify-to-wechat

0 commit comments

Comments
 (0)