Skip to content

Commit 8a30aa3

Browse files
author
northjhuang
committed
add suspend in workflow disruption template
1 parent 3275c6b commit 8a30aa3

11 files changed

+314
-54
lines changed

playbook/README.md

Lines changed: 33 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,19 @@ Supports enabling `etcd Overload Protection` and `APF Flow Control` [APF Rate Li
3333
| `inject-stress-list-qps` | `int` | "100" | QPS per stress test Pod |
3434
| `inject-stress-total-duration` | `string` | "30s" | Total test duration (e.g. 30s, 5m) |
3535

36+
**Recommended Parameters for TKE Clusters**
37+
38+
| Cluseter Level | resource-create-object-size-bytes | resource-create-object-count | resource-create-qps | inject-stress-concurrency | inject-stress-list-qps |
39+
|---------|----------------------------------|-----------------------------|---------------------|--------------------------|-----------------------|
40+
| L5 | 10000 | 100 | 10 | 6 | 200 |
41+
| L50 | 10000 | 300 | 10 | 6 | 200 |
42+
| L100 | 50000 | 500 | 20 | 6 | 200 |
43+
| L200 | 100000 | 1000 | 50 | 9 | 200 |
44+
| L500 | 100000 | 1000 | 50 | 12 | 200 |
45+
| L1000 | 100000 | 3000 | 50 | 12 | 300 |
46+
| L3000 | 100000 | 6000 | 500 | 18 | 500 |
47+
| L5000 | 100000 | 10000 | 500 | 21 | 500 |
48+
3649
**etcd Overload Protection & Enhanced APF**
3750

3851
Tencent Cloud TKE team has developed these core protection features:
@@ -56,31 +69,39 @@ Supported versions:
5669
**playbook**: `workflow/coredns-disruption-scenario.yaml`
5770

5871
This scenario simulates coredns service disruption by:
59-
1. Scaling coredns Deployment replicas to 0
60-
2. Maintaining zero replicas for specified duration
61-
3. Restoring original replica count
72+
73+
1. **Pre-check**: Verify the existence of the `tke-chaos-test/tke-chaos-precheck-resource ConfigMap` in the target cluster to ensure the cluster is available for testing
74+
75+
2. **Component Shutdown**: Log in to the Argo Web UI, click on `coredns-disruption-scenario workflow`, then click the `RESUME` button under the `SUMMARY` tab of the `suspend-1` node to scale down the coredns Deployment replicas to 0
76+
77+
3. **Service Validation**: During the coredns disruption, you can verify whether your services are affected by the coredns disruption
78+
79+
4. **Component Recovery**: Click the `RESUME` button under the `SUMMARY` tab of the `suspend-2` node to restore the coredns Deployment replicas
6280

6381
**Parameters**
6482

6583
| Parameter | Type | Default | Description |
6684
|-----------|------|---------|-------------|
67-
| `disruption-duration` | `string` | `30s` | Disruption duration (e.g. 30s, 5m) |
6885
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | Target cluster kubeconfig secret name |
6986

7087
## kubernetes-proxy Disruption
7188

7289
**playbook**: `workflow/kubernetes-proxy-disruption-scenario.yaml`
7390

7491
This scenario simulates kubernetes-proxy service disruption by:
75-
1. Scaling kubernetes-proxy Deployment replicas to 0
76-
2. Maintaining zero replicas for specified duration
77-
3. Restoring original replica count
92+
93+
1. **Pre-check**: Verify the existence of the `tke-chaos-test/tke-chaos-precheck-resource ConfigMap` in the target cluster to ensure the cluster is available for testing
94+
95+
2. **Component Shutdown**: Log in to the Argo Web UI, click on `kubernetes-proxy-disruption-scenario workflow`, then click the `RESUME` button under the `SUMMARY` tab of the `suspend-1` node to scale down the kubernetes-proxy Deployment replicas to 0
96+
97+
3. **Service Validation**: During the kubernetes-proxy disruption, you can verify whether your services are affected by the kubernetes-proxy disruption
98+
99+
4. **Component Recovery**: Click the `RESUME` button under the `SUMMARY` tab of the `suspend-2` node to restore the kubernetes-proxy Deployment replicas
78100

79101
**Parameters**
80102

81103
| Parameter | Type | Default | Description |
82104
|-----------|------|---------|-------------|
83-
| `disruption-duration` | `string` | `30s` | Disruption duration (e.g. 30s, 5m) |
84105
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | Target cluster kubeconfig secret name |
85106

86107
## Namespace Deletion Protection
@@ -140,10 +161,10 @@ kubectl create -f workflow/managed-cluster-master-component/restore-apiserver.ya
140161

141162
| Parameter | Type | Default | Description |
142163
|-----------|------|---------|-------------|
143-
| `region` | `string` | `<REGION>` | Tencent Cloud region, e.g. `ap-guangzhou` [Region List](https://www.tencentcloud.com/document/product/213/6091?lang=en&pg=) |
144-
| `secret-id` | `string` | `<SECRET_ID>` | Tencent Cloud API secret ID, obtain from [API Key Management](https://console.cloud.tencent.com/cam/capi) |
145-
| `secret-key` | `string` | `<SECRET_KEY>` | Tencent Cloud API secret key |
146-
| `cluster-id` | `string` | `<CLUSTER_ID>` | Target cluster ID |
164+
| `region` | `string` | "" | Tencent Cloud region, e.g. `ap-guangzhou` [Region List](https://www.tencentcloud.com/document/product/213/6091?lang=en&pg=) |
165+
| `secret-id` | `string` | "" | Tencent Cloud API secret ID, obtain from [API Key Management](https://console.cloud.tencent.com/cam/capi) |
166+
| `secret-key` | `string` | "" | Tencent Cloud API secret key |
167+
| `cluster-id` | `string` | "" | Target cluster ID |
147168
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | Secret name containing target cluster kubeconfig |
148169

149170
**Notes**

playbook/README_zh.md

Lines changed: 27 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,19 @@
3333
| `inject-stress-list-qps` | `int` | "100" | 每个发压`Pod``QPS` |
3434
| `inject-stress-total-duration` | `string` | "30s" | 发压执行总时长(如30s,5m等) |
3535

36+
**TKE集群推荐压测参数**
37+
38+
| 集群规格 | resource-create-object-size-bytes | resource-create-object-count | resource-create-qps | inject-stress-concurrency | inject-stress-list-qps |
39+
|---------|----------------------------------|-----------------------------|---------------------|--------------------------|-----------------------|
40+
| L5 | 10000 | 100 | 10 | 6 | 200 |
41+
| L50 | 10000 | 300 | 10 | 6 | 200 |
42+
| L100 | 50000 | 500 | 20 | 6 | 200 |
43+
| L200 | 100000 | 1000 | 50 | 9 | 200 |
44+
| L500 | 100000 | 1000 | 50 | 12 | 200 |
45+
| L1000 | 100000 | 3000 | 50 | 12 | 300 |
46+
| L3000 | 100000 | 6000 | 500 | 18 | 500 |
47+
| L5000 | 100000 | 10000 | 500 | 21 | 500 |
48+
3649
**etcd过载保护&增强apf限流说明**
3750

3851
腾讯云TKE团队在社区版本基础上开发了以下核心保护特性:
@@ -56,31 +69,33 @@
5669
**playbook**`workflow/coredns-disruption-scenario.yaml`
5770

5871
该场景通过以下方式构造`coredns`服务中断:
59-
1.`coredns Deployment`副本数缩容到`0`
60-
2. 维持指定时间副本数为`0`
61-
3. 恢复原有副本数
72+
73+
1. **前置检查**:验证目标集群中存在`tke-chaos-test/tke-chaos-precheck-resource ConfigMap`,确保集群可用于演练
74+
2. **组件停机**:登录argo Web UI,点击`coredns-disruption-scenario workflow`,点击`suspend-1`节点`SUMMARY`标签下的`RESUME`按钮,将`coredns Deployment`副本数缩容到`0`
75+
3. **业务验证**`coredns`停服期间,您可以去验证您的业务是否受到`cordns`停服的影响
76+
4. **组件恢复**:点击`suspend-2`节点`SUMMARY`标签下的`RESUME`按钮,将`coredns Deployment`副本数恢复
6277

6378
**参数说明**
6479

6580
| 参数名称 | 类型 | 默认值 | 说明 |
6681
|---------|------|--------|------|
67-
| `disruption-duration` | `string` | `30s` | 服务中断持续时间(如30s,5m等) |
6882
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | `目标集群kubeconfig secret`名称,如为空,则演练当前集群 |
6983

7084
## kubernetes-proxy停服
7185

7286
**playbook**`workflow/kubernetes-proxy-disruption-scenario.yaml`
7387

7488
该场景通过以下方式构造`kubernetes-proxy`服务中断:
75-
1.`kubernetes-proxy` `Deployment`副本数缩容到0
76-
2. 维持指定时间副本数为`0`
77-
3. 恢复原有副本数
89+
90+
1. **前置检查**:验证目标集群中存在`tke-chaos-test/tke-chaos-precheck-resource ConfigMap`,确保集群可用于演练
91+
2. **组件停机**:登录argo Web UI,点击`kubernetes-proxy-disruption-scenario workflow`,点击`suspend-1`节点`SUMMARY`标签下的`RESUME`按钮,将`kubernetes-proxy Deployment`副本数缩容到`0`
92+
3. **业务验证**`kubernetes-proxy`停服期间,您可以去验证您的业务是否受到`kubernetes-proxy`停服的影响
93+
4. **组件恢复**:点击`suspend-2`节点`SUMMARY`标签下的`RESUME`按钮,将`kubernetes-proxy Deployment`副本数恢复
7894

7995
**参数说明**
8096

8197
| 参数名称 | 类型 | 默认值 | 说明 |
8298
|---------|------|--------|------|
83-
| `disruption-duration` | `string` | `30s` | 服务中断持续时间(如30s,5m等) |
8499
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | `目标集群kubeconfig secret`名称,如为空,则演练当前集群 |
85100

86101
## 命名空间删除防护
@@ -139,10 +154,10 @@ kubectl create -f workflow/managed-cluster-master-component/restore-apiserver.ya
139154

140155
| 参数名称 | 类型 | 默认值 | 说明 |
141156
|---------|------|--------|------|
142-
| `region` | `string` | `<REGION>` | 腾讯云地域,如`ap-guangzhou` [地域查询](https://www.tencentcloud.com/zh/document/product/213/6091) |
143-
| `secret-id` | `string` | `<SECRET_ID>` | 腾讯云API密钥ID, 密钥可前往官网控制台 [API密钥管理](https://console.cloud.tencent.com/cam/capi) 进行获取 |
144-
| `secret-key` | `string` | `<SECRET_KEY>` | 腾讯云API密钥 |
145-
| `cluster-id` | `string` | `<CLUSTER_ID>` | 演练集群ID |
157+
| `region` | `string` | "" | 腾讯云地域,如`ap-guangzhou` [地域查询](https://www.tencentcloud.com/zh/document/product/213/6091) |
158+
| `secret-id` | `string` | "" | 腾讯云API密钥ID, 密钥可前往官网控制台 [API密钥管理](https://console.cloud.tencent.com/cam/capi) 进行获取 |
159+
| `secret-key` | `string` | "" | 腾讯云API密钥 |
160+
| `cluster-id` | `string` | "" | 演练集群ID |
146161
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | 目标集群kubeconfig secret名称 |
147162

148163
**注意事项**

playbook/all-in-one-template.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1392,8 +1392,6 @@ spec:
13921392
- name: main
13931393
inputs:
13941394
parameters:
1395-
- name: disruption-duration
1396-
description: "服务中断持续时间"
13971395
- name: workload-type
13981396
description: "要测试的工作负载类型, 可选值: daemonset/deployment/statefulset"
13991397
- name: workload-name
@@ -1409,6 +1407,8 @@ spec:
14091407
default: "tke-chaos-test"
14101408
description: "预检查配置configmap所在命名空间"
14111409
steps:
1410+
- - name: suspend-1
1411+
template: suspend
14121412
- - name: precheck
14131413
arguments:
14141414
parameters:
@@ -1437,7 +1437,7 @@ spec:
14371437
- name: kubeconfig-secret-name
14381438
value: "{{inputs.parameters.kubeconfig-secret-name}}"
14391439
template: scale-down-workload
1440-
- - name: suspend
1440+
- - name: suspend-2
14411441
template: suspend
14421442
- - name: scale-up-workload
14431443
arguments:

playbook/template/workload-disruption-template.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,6 @@ spec:
1313
- name: main
1414
inputs:
1515
parameters:
16-
- name: disruption-duration
17-
description: "服务中断持续时间"
1816
- name: workload-type
1917
description: "要测试的工作负载类型, 可选值: daemonset/deployment/statefulset"
2018
- name: workload-name
@@ -30,6 +28,8 @@ spec:
3028
default: "tke-chaos-test"
3129
description: "预检查配置configmap所在命名空间"
3230
steps:
31+
- - name: suspend-1
32+
template: suspend
3333
- - name: precheck
3434
arguments:
3535
parameters:
@@ -58,7 +58,7 @@ spec:
5858
- name: kubeconfig-secret-name
5959
value: "{{inputs.parameters.kubeconfig-secret-name}}"
6060
template: scale-down-workload
61-
- - name: suspend
61+
- - name: suspend-2
6262
template: suspend
6363
- - name: scale-up-workload
6464
arguments:

playbook/workflow/apiserver-overload-scenario.yaml

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,10 @@ spec:
153153
templates:
154154
- name: main
155155
steps:
156+
- - name: validate-params
157+
template: validate-params
158+
- - name: suspend
159+
template: suspend
156160
- - name: create-apf # 演练开始前, 创建apf限速
157161
arguments:
158162
parameters:
@@ -272,3 +276,25 @@ spec:
272276
template: etcd-protect-cm-orchestrate
273277
clusterScope: true
274278
when: "'{{workflow.parameters.enable-etcd-overload-protect}}' == 'true'"
279+
280+
- name: suspend
281+
suspend: {}
282+
283+
- name: validate-params
284+
script:
285+
image: bitnami/kubectl:1.32.4
286+
command: [bash]
287+
source: |
288+
#!/bin/bash
289+
set -e
290+
if [[ -z "{{workflow.parameters.kubeconfig-secret-name}}" ]]; then
291+
echo "[ERROR] kubeconfig-secret-name parameter cannot be empty" > /tmp/validate_result
292+
exit 1
293+
fi
294+
echo "Parameter validation passed"
295+
outputs:
296+
parameters:
297+
- name: result
298+
valueFrom:
299+
default: "null"
300+
path: /tmp/validate_result

playbook/workflow/coredns-disruption-scenario.yaml

Lines changed: 39 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,6 @@ spec:
1111
serviceAccountName: tke-chaos
1212
arguments:
1313
parameters:
14-
- name: disruption-duration
15-
value: "30s"
1614
- name: workload-type
1715
value: "deployment"
1816
- name: workload-name
@@ -21,7 +19,42 @@ spec:
2119
value: "kube-system"
2220
- name: kubeconfig-secret-name
2321
value: "dest-cluster-kubeconfig"
24-
serviceAccountName: tke-chaos
25-
workflowTemplateRef:
26-
name: workload-disruption-template
27-
clusterScope: true
22+
templates:
23+
- name: main
24+
steps:
25+
- - name: validate-params
26+
template: validate-params
27+
- - name: run-coredns-disruption
28+
templateRef:
29+
name: workload-disruption-template
30+
template: main
31+
clusterScope: true
32+
arguments:
33+
parameters:
34+
- name: workload-type
35+
value: "{{workflow.parameters.workload-type}}"
36+
- name: workload-name
37+
value: "{{workflow.parameters.workload-name}}"
38+
- name: workload-namespace
39+
value: "{{workflow.parameters.workload-namespace}}"
40+
- name: kubeconfig-secret-name
41+
value: "{{workflow.parameters.kubeconfig-secret-name}}"
42+
43+
- name: validate-params
44+
script:
45+
image: bitnami/kubectl:1.32.4
46+
command: [bash]
47+
source: |
48+
#!/bin/bash
49+
set -e
50+
if [[ -z "{{workflow.parameters.kubeconfig-secret-name}}" ]]; then
51+
echo "[ERROR] kubeconfig-secret-name parameter cannot be empty" > /tmp/validate_result
52+
exit 1
53+
fi
54+
echo "Parameter validation passed"
55+
outputs:
56+
parameters:
57+
- name: result
58+
valueFrom:
59+
default: "null"
60+
path: /tmp/validate_result

playbook/workflow/kubernetes-proxy-disruption-scenario.yaml

Lines changed: 39 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,6 @@ spec:
1111
serviceAccountName: tke-chaos
1212
arguments:
1313
parameters:
14-
- name: disruption-duration
15-
value: "30s"
1614
- name: workload-type
1715
value: "deployment"
1816
- name: workload-name
@@ -21,7 +19,42 @@ spec:
2119
value: "default"
2220
- name: kubeconfig-secret-name
2321
value: "dest-cluster-kubeconfig"
24-
serviceAccountName: tke-chaos
25-
workflowTemplateRef:
26-
name: workload-disruption-template
27-
clusterScope: true
22+
templates:
23+
- name: main
24+
steps:
25+
- - name: validate-params
26+
template: validate-params
27+
- - name: run-kubernetes-proxy-disruption
28+
templateRef:
29+
name: workload-disruption-template
30+
template: main
31+
clusterScope: true
32+
arguments:
33+
parameters:
34+
- name: workload-type
35+
value: "{{workflow.parameters.workload-type}}"
36+
- name: workload-name
37+
value: "{{workflow.parameters.workload-name}}"
38+
- name: workload-namespace
39+
value: "{{workflow.parameters.workload-namespace}}"
40+
- name: kubeconfig-secret-name
41+
value: "{{workflow.parameters.kubeconfig-secret-name}}"
42+
43+
- name: validate-params
44+
script:
45+
image: bitnami/kubectl:1.32.4
46+
command: [bash]
47+
source: |
48+
#!/bin/bash
49+
set -e
50+
if [[ -z "{{workflow.parameters.kubeconfig-secret-name}}" ]]; then
51+
echo "[ERROR] kubeconfig-secret-name parameter cannot be empty" > /tmp/validate_result
52+
exit 1
53+
fi
54+
echo "Parameter validation passed"
55+
outputs:
56+
parameters:
57+
- name: result
58+
valueFrom:
59+
default: "null"
60+
path: /tmp/validate_result

0 commit comments

Comments
 (0)