Merge pull request #10 from SQxiaoxiaomeng/fix-stress-parameters

tangcong · web-flow · commit 3275c6b87b59 · 2025-06-23T10:09:54.000+08:00
update README and fix resource-create-object-type error
diff --git a/README.md b/README.md
@@ -7,14 +7,14 @@
 Kubernetes' centralized architecture and declarative management model, while enabling efficient operations, also introduce critical risks of cascading failures. The open ecosystem (with third-party components like Flink and Rancher) and complex multi-service environments further exacerbate these risks:
 
 - Cascading deletion disaster: A customer using Rancher to manage Kubernetes clusters accidentally deleted a namespace, which subsequently deleted all core business workloads and Pods in the production cluster, causing service interruption.
-- Control plane overload: In a large OpenAI cluster, deploying a DaemonSet monitoring component triggered control plane failures and coredns overload. The coredns scaling depended on control plane recovery, affecting the data plane and causing global OpenAI service outages.
-- Data plane's strong dependency on control plane: In open-source Flink on Kubernetes scenarios, kube-apiserver outages may cause Flink task checkpoint failures and leader election anomalies. In severe cases, it may trigger abnormal exits of all existing task Pods, leading to complete data plane collapse and major incidents.
+- Control plane overload: In a large OpenAI cluster, deploying a DaemonSet monitoring component triggered control plane failures and coredns overload. The coredns scaling depended on control plane recovery, affecting the data plane and causing global OpenAI service disruption.
+- Data plane's strong dependency on control plane: In open-source Flink on Kubernetes scenarios, kube-apiserver disruption may cause Flink task checkpoint failures and leader election anomalies. In severe cases, it may trigger abnormal exits of all existing task Pods, leading to complete data plane collapse and major incidents.
 
 These cases are not uncommon. The root cause lies in Kubernetes' architecture vulnerability chain - a single component failure or incorrect command can trigger global failures through centralized pathways. 
 
 To proactively understand the impact duration and severity of control plane failures on services, we should conduct regular fault simulation and assessments to improve failure response capabilities, ensuring Kubernetes environment stability and reliability. 
 
-This project provides Kubernetes chaos testing capabilities covering scenarios like node shutdown, accidental resource deletion, and control plane component (etcd, kube-apiserver, coredns, etc.) overload/outage, it will help you minimize blast radius of cluster failures.
+This project provides Kubernetes chaos testing capabilities covering scenarios like node shutdown, accidental resource deletion, and control plane component (etcd, kube-apiserver, coredns, etc.) overload/disruption, it will help you minimize blast radius of cluster failures.
 
 ## Prerequisites
 
@@ -47,6 +47,9 @@ kubectl get po -n tke-chaos-test
 ```
 
 5. Enable public access for `tke-chaos-test/tke-chaos-argo-workflows-server Service` in Tencent Cloud TKE Console. Access Argo Server UI at `LoadBalancer IP:2746` using credentials obtained via:
+
+Note: If the cluster restricts public access, please configure the Service for internal access and connect via internal network.
+
 ```bash
 # Get Argo Server UI access token
 kubectl exec -it -n tke-chaos-test deployment/tke-chaos-argo-workflows-server -- argo auth token
@@ -62,7 +65,8 @@ Using `kube-apiserver overload` as an example:
 
 - Create kube-apiserver overload workflow:
 ```bash
-kubectl create -f playbook/rbac.yaml && kubectl create -f playbook/all-in-one-template.yaml && kubectl create -f playbook/workflow/apiserver-overload-scenario.yaml
+kubectl create -f playbook/rbac.yaml && kubectl create -f playbook/all-in-one-template.yaml
+kubectl create -f playbook/workflow/apiserver-overload-scenario.yaml
 ```
 
 ![apiserver overload flowchart](./playbook/docs/chaos-flowchart-en.png)
@@ -74,11 +78,9 @@ kubectl create -f playbook/rbac.yaml && kubectl create -f playbook/all-in-one-te
 - **Execute Testing**: During kube-apiserver overload testing, the system floods `dest cluster`'s kube-apiserver with List Pod requests to simulate high load. Monitor kube-apiserver metrics via Tencent Cloud TKE Console and observe your business Pod health during testing.
 - **Result Processing**: View testing results in Argo Server UI (recommended) or via `kubectl describe workflow {workflow-name}`.
 
-### Stopping Tests
+### Deleting Tests
 ```bash
-# Stop tests
-kubectl get workflow
-kubectl delete workflow {workflow-name}
+kubectl delete -f playbook/workflow/apiserver-overload-scenario.yaml
 ```
 
 ## Roadmap
@@ -89,20 +91,19 @@ kubectl delete workflow {workflow-name}
 | etcd overload                              |   -      | Completed   |      -          | Simulate etcd high load                                         |
 | apiserver overload (APF)                   |   -      | Completed   |      -          | Add Expensive List APF Policy,Simulate kube-apiserver high load |
 | etcd overload (ReadCache/Consistent cache) |   -      | Completed   |      -          | Add Etcd Overload Protect Policy, Simulate etcd high load       |
-| coredns outage                             |   -      | Completed   |      -          | Simulate coredns service outage                                 |
-| kubernetes-proxy outage                    |   -      | Completed   |      -          | Simulate kubernetes-proxy outage                                |
+| coredns disruption                         |   -      | Completed   |      -          | Simulate coredns service disruption                                 |
+| kubernetes-proxy disruption                |   -      | Completed   |      -          | Simulate kubernetes-proxy disruption                                |
 | accidental deletion scenario               |   -      | Completed   |      -          | Simulate accidental resource deletion                           |
-| kube-apiserver outage                      |  P0      | In Progress |  2025-06-15     | Simulate kube-apiserver outage                                  |
-| etcd outage                                | P0       | In Progress |  2025-06-15     | Simulate etcd cluster failure                                   |
-| kube-scheduler outage                      | P0       | In Progress |  2025-06-15     | Test scheduling behavior during scheduler failure               |
-| kube-controller-manager outage             | P0       | In Progress |  2025-06-15     | Validate controller component failure scenarios                 |
-| cloud-controller-manager outage            | P0       | In Progress |  2025-06-15     | Validate controller component failure scenarios                 |
-| master node shutdown                       | P1       | In Progress |  2025-06-15     | Simulate master node poweroff                                   |
+| TKE managed cluster kube-apiserver disruption     |   -      | Completed   |      -          | Simulate kube-apiserver disruption                                  |
+| TKE managed cluster kube-scheduler disruption      | -       | Completed   |      -          | Test scheduling behavior during scheduler failure               |
+| TKE managed cluster kube-controller-manager disruption     | -       | Completed   |      -          | Validate controller component failure scenarios                 |
+| TKE Self-Maintenance Cluster master node shutdown  | P1       | In Progress |  2025-06-30     | Simulate master node poweroff                                   |
+| etcd disruption                                | P1       | In Progress |  2025-06-30     | Simulate etcd cluster failure                                   |
 
 ## FAQ
 1. Why use two clusters for fault simulation?
 
-  Testings are orchestrated using Argo Workflow, which follows a CRD-based pattern heavily dependent on kube-apiserver. Using a single cluster for fault simulation (especially apiserver/etcd overload or outage tests) would make kube-apiserver unavailable, preventing Argo Workflow Controller from functioning and halting the entire workflow.
+  Testings are orchestrated using Argo Workflow, which follows a CRD-based pattern heavily dependent on kube-apiserver. Using a single cluster for fault simulation (especially apiserver/etcd overload or disruption tests) would make kube-apiserver unavailable, preventing Argo Workflow Controller from functioning and halting the entire workflow.
 
 2. How to track testing progress after starting?
 
diff --git a/README_zh.md b/README_zh.md
@@ -46,7 +46,10 @@ kubectl create -f playbook/install-argo.yaml
 kubectl get po -n tke-chaos-test
 ```
 
-5. 腾讯云`TKE控制台`开启`tke-chaos-test/tke-chaos-argo-workflows-server Service`公网访问，浏览器访问`LoadBalancer IP:2746`，执行如下命令获取的`Argo Server UI`接入凭证登录`Argo UI`，`Argo UI`可查看演练流程的详细信息。
+5. 腾讯云`TKE控制台`开启`tke-chaos-test/tke-chaos-argo-workflows-server Service`公网访问，浏览器访问`LoadBalancer IP:2746`。执行如下命令获取的`Argo Server UI`接入凭证登录`Argo UI`，`Argo UI`可用于查看演练流程的详细信息。
+
+注：若集群限制公网访问，请配置Service内网访问并通过内网访问。
+
 ```bash
 # 获取Argo Server UI接入凭证
 kubectl exec -it -n tke-chaos-test deployment/tke-chaos-argo-workflows-server -- argo auth token
@@ -62,11 +65,11 @@ kubectl exec -it -n tke-chaos-test deployment/tke-chaos-argo-workflows-server --
 
 - 创建`kube-apiserver`高负载故障演练`workflow`：
 ```bash
-kubectl create -f playbook/rbac.yaml && kubectl create -f playbook/all-in-one-template.yaml && kubectl create -f playbook/workflow/apiserver-overload-scenario.yaml
+kubectl create -f playbook/rbac.yaml && kubectl create -f playbook/all-in-one-template.yaml
+kubectl create -f playbook/workflow/apiserver-overload-scenario.yaml
 ```
 
-![apiserver高负载演练流程图](./playbook/docs/chaos-flowchart-zh.png)
-
+![演练流程图](./playbook/docs/chaos-flowchart-zh.png)
 
 **核心流程说明**
 
@@ -75,11 +78,9 @@ kubectl create -f playbook/rbac.yaml && kubectl create -f playbook/all-in-one-te
 - **执行演练**：`kube-apiserver`高负载演练执行过程中，会对`目标集群`的`kube-apiserver`发起大量的洪泛`List Pod`请求，以模拟`kube-apiserver`高负载场景，您可以访问`腾讯云TKE控制台`的`目标集群`核心组件监控，查看`kube-apiserver`的负载情况。同时，您应该关注演练过程中您的业务Pod的健康状态，以验证`kube-apiserver`高负载是否会影响您的业务。
 - **演练结果**：您可以访问`Argo Server UI`查看演练结果（推荐），您也可以执行`kubectl describe workflow {workflow-name}`查看演练结果。
 
-### 停止测试
+### 删除演练
 ```bash
-# 停止测试
-kubectl get workflow
-kubectl delete worflow {workflow-name}
+kubectl delete -f playbook/workflow/apiserver-overload-scenario.yaml
 ```
 
 ## 功能规划路线图
@@ -93,12 +94,11 @@ kubectl delete worflow {workflow-name}
 | coredns停服                    |   -   |      完成     |      -       | 模拟coredns服务中断场景                                   |
 | kubernetes-proxy停服           |   -   |      完成     |      -       | 模拟kubernetes-proxy服务中断场景                          |
 | 资源误删除场景                      |  -   |    完成     |      -       | 模拟资源被误删除场景                                        |
-| kube-apiserver停服演练           |  P0   |    开发中     |  2025-06-15  | 模拟kube-apiserver服务中断场景                            |
-| etcd停服演练                     | P0    |    开发中     |  2025-06-15  | 模拟etcd集群故障场景                                      |
-| kube-scheduler停服演练           | P0    |    开发中     |  2025-06-15  | 测试调度器故障期间的集群调度行为                                  |
-| kube-controller-manager停服演练  | P0    |    开发中     |  2025-06-15  | 验证控制器组件故障场景                                       |
-| cloud-controller-manager停服演练 | P0    |    开发中     |  2025-06-15  | 验证控制器组件故障场景                                       |
-| master节点停机                   | P1    |    开发中     |  2025-06-15  | 模拟master关机场景                                      |
+| TKE托管集群kube-apiserver停服演练  |  -   |    完成     |        -       | 模拟kube-apiserver服务中断场景                            |
+| TKE托管集群kube-scheduler停服演练  | -    |    完成     |        -       | 测试调度器故障期间的集群调度行为                                  |
+| TKE托管集群kube-controller-manager停服演练  | -    |    完成     |        -       | 验证控制器组件故障场景                                       |
+| TKE自维护集群master节点停机        | P1    |    开发中     |  2025-06-30  | 模拟master关机场景                                      |
+| etcd停服演练                     | P1    |    开发中     |  2025-06-30  | 模拟etcd集群故障场景                                      |
 
 ## 常见问题
 1. 为什么要用两个集群来执行演练测试?
diff --git a/playbook/README.md b/playbook/README.md
@@ -100,20 +100,11 @@ This scenario tests Tencent Cloud TKE's namespace deletion block policy with the
 
 Tencent Cloud TKE supports various resource protection policies, such as CRD deletion protection, PV deletion protection, etc. You can refer to the official Tencent Cloud documentation for more details: [Policy Management](https://cloud.tencent.com/document/product/457/103179)
 
-## TKE Self-maintenance of Master cluster's kube-apiserver Disruption
-TODO
-
-## TKE Self-maintenance of Master cluster's etcd Disruption
-TODO
-
-## TKE Self-maintenance of Master cluster's kube-controller-manager Disruption
-TODO
-
-## TKE Self-maintenance of Master cluster's kube-scheduler Disruption
-TODO
-
 ## Managed Cluster Master Component Disruption
 
+1. Your cluster name must contain the words `Chaos Experiment` or `混沌演练` and the cluster size must be smaller than `L1000`, otherwise the Tencent Cloud API call will fail
+2. You need to modify the `region`, `secret-id`, `secret-key`, and `cluster-id` parameters in the YAML file ([Parameter Explanation](#managed-cluster-master-component-parameters))
+
 **playbooks**:
 1. kube-apiserver disruption: `workflow/managed-cluster-apiserver-shutdown-scenario.yaml`
 2. kube-controller-manager disruption: `workflow/managed-cluster-controller-manager-shutdown-scenario.yaml` 
@@ -144,16 +135,20 @@ kubectl create -f workflow/managed-cluster-master-component/shutdown-apiserver.y
 kubectl create -f workflow/managed-cluster-master-component/restore-apiserver.yaml
 ```
 
+<a id="managed-cluster-master-component-parameters"></a>
 **Parameters**
 
 | Parameter | Type | Default | Description |
 |-----------|------|---------|-------------|
-| `region` | `string` | <REGION> | Tencent Cloud region, e.g. `ap-guangzhou` [Region List](https://www.tencentcloud.com/document/product/213/6091?lang=en&pg=) |
-| `secret-id` | `string` | <SECRET_ID> | Tencent Cloud API secret ID, obtain from [API Key Management](https://console.cloud.tencent.com/cam/capi) |
-| `secret-key` | `string` | <SECRET_KEY> | Tencent Cloud API secret key |
-| `cluster-id` | `string` | <CLUSTER_ID> | Target cluster ID |
+| `region` | `string` | `<REGION>` | Tencent Cloud region, e.g. `ap-guangzhou` [Region List](https://www.tencentcloud.com/document/product/213/6091?lang=en&pg=) |
+| `secret-id` | `string` | `<SECRET_ID>` | Tencent Cloud API secret ID, obtain from [API Key Management](https://console.cloud.tencent.com/cam/capi) |
+| `secret-key` | `string` | `<SECRET_KEY>` | Tencent Cloud API secret key |
+| `cluster-id` | `string` | `<CLUSTER_ID>` | Target cluster ID |
 | `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | Secret name containing target cluster kubeconfig |
 
 **Notes**
 1. Will affect master component availability during test
 2. Recommended to execute in non-production environments or maintenance windows
+
+## Self-Maintenance Cluster Master Component Disruption
+TODO
diff --git a/playbook/README_zh.md b/playbook/README_zh.md
@@ -100,20 +100,11 @@
 
 腾讯云TKE支持大量的资源防护策略，如`CRD`删除保护、`PV`删除保护等，您可以访问腾讯云官方文档以查看详细信息[策略管理](https://cloud.tencent.com/document/product/457/103179)
 
-## TKE Master自维护集群kube-apiserver停服
-TODO
-
-## TKE Master自维护集群etcd停服
-TODO
-
-## TKE Master自维护集群kube-controller-manager停服
-TODO
-
-## TKE Master自维护集群kube-scheduler停服
-TODO
-
 ## 托管集群master组件停服
 
+1. 您的集群名称中需要包含`Chaos Experiment`或`混沌演练`字样且集群规模小于`L1000`，否则腾讯云API将会调用失败
+2. 您需要修改演练`YAML`文件中`region`、`secret-id`、`secret-key`、`cluster-id`参数([参数说明](#托管集群master组件停服参数说明))
+
 **playbook**
 1. kube-apiserver停服&恢复：`workflow/managed-cluster-apiserver-shutdown-scenario.yaml`
 2. kube-controller-manager停服&恢复：`workflow/managed-cluster-controller-manager-shutdown-scenario.yaml`
@@ -143,20 +134,21 @@ kubectl create -f workflow/managed-cluster-master-component/shutdown-apiserver.y
 ```bash
 kubectl create -f workflow/managed-cluster-master-component/restore-apiserver.yaml
 ```
-
+<a id="托管集群master组件停服参数说明"></a>
 **参数说明**
 
-您需要修改演练`YAML`文件中`region`、`secret-id`、`secret-key`、`cluster-id`参数，参数说明如下：
-
 | 参数名称 | 类型 | 默认值 | 说明 |
 |---------|------|--------|------|
-| `region` | `string` | <REGION> | 腾讯云地域，如`ap-guangzhou` [地域查询](https://www.tencentcloud.com/zh/document/product/213/6091) |
-| `secret-id` | `string` | <SECRET_ID> | 腾讯云API密钥ID, 密钥可前往官网控制台 [API密钥管理](https://console.cloud.tencent.com/cam/capi) 进行获取 |
-| `secret-key` | `string` | <SECRET_KEY> | 腾讯云API密钥 |
-| `cluster-id` | `string` | <CLUSTER_ID> | 演练集群ID |
+| `region` | `string` | `<REGION>` | 腾讯云地域，如`ap-guangzhou` [地域查询](https://www.tencentcloud.com/zh/document/product/213/6091) |
+| `secret-id` | `string` | `<SECRET_ID>` | 腾讯云API密钥ID, 密钥可前往官网控制台 [API密钥管理](https://console.cloud.tencent.com/cam/capi) 进行获取 |
+| `secret-key` | `string` | `<SECRET_KEY>` | 腾讯云API密钥 |
+| `cluster-id` | `string` | `<CLUSTER_ID>` | 演练集群ID |
 | `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | 目标集群kubeconfig secret名称 |
 
 **注意事项**
 
 2. 演练过程中会影响集群`master`组件服务可用性
 3. 建议在非生产环境或维护窗口期执行
+
+## 自维护集群master组件停服
+TODO
diff --git a/playbook/all-in-one-template.yaml b/playbook/all-in-one-template.yaml
@@ -979,7 +979,7 @@ spec:
           name: resource-archestrate
           template: resource-create
           clusterScope: true
-        when: "'{{inputs.parameters.enable-resource-create}}' == 'true'"
+        when: "{{steps.precheck.status}} == Succeeded && '{{inputs.parameters.enable-resource-create}}' == 'true'"
 
     - - name: notify-inject-stress  # 通知: 开始注入故障
         continueOn:
@@ -1110,14 +1110,14 @@ spec:
         arguments:
           parameters:
           - name: cmd
-            value: "delete -n {{inputs.parameters.resource-create-namespace}} {{inputs.parameters.resource-create-object-type}} --all"
+            value: "delete -n {{inputs.parameters.resource-create-namespace}} {{inputs.parameters.resource-create-object-type}} -l kubestress"
           - name: kubeconfig-secret-name
             value: "{{inputs.parameters.kubeconfig-secret-name}}"
         templateRef:
           name: kubectl-cmd
           template: kubectl-cmd
           clusterScope: true
-        when: "'{{inputs.parameters.enable-resource-create}}' == 'true'"
+        when: "{{steps.precheck.status}} == Succeeded && '{{inputs.parameters.enable-resource-create}}' == 'true'"
 
 
   - name: metrics-collect-then-notify-to-wechat
diff --git a/playbook/template/apiserver-overload-template.yaml b/playbook/template/apiserver-overload-template.yaml
diff --git a/playbook/workflow/apiserver-overload-scenario.yaml b/playbook/workflow/apiserver-overload-scenario.yaml