Skip to content

Commit 6af8fd6

Browse files
author
northjhuang
committed
add suspend in workflow disruption template
1 parent 3275c6b commit 6af8fd6

File tree

6 files changed

+58
-26
lines changed

6 files changed

+58
-26
lines changed

playbook/README.md

Lines changed: 29 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,19 @@ Supports enabling `etcd Overload Protection` and `APF Flow Control` [APF Rate Li
3333
| `inject-stress-list-qps` | `int` | "100" | QPS per stress test Pod |
3434
| `inject-stress-total-duration` | `string` | "30s" | Total test duration (e.g. 30s, 5m) |
3535

36+
**Recommended Parameters for TKE Clusters**
37+
38+
| Cluseter Level | resource-create-object-size-bytes | resource-create-object-count | resource-create-qps | inject-stress-concurrency | inject-stress-list-qps |
39+
|---------|----------------------------------|-----------------------------|---------------------|--------------------------|-----------------------|
40+
| L5 | 10000 | 100 | 10 | 6 | 200 |
41+
| L50 | 10000 | 300 | 10 | 6 | 200 |
42+
| L100 | 50000 | 500 | 20 | 6 | 200 |
43+
| L200 | 100000 | 1000 | 50 | 9 | 200 |
44+
| L500 | 100000 | 1000 | 50 | 12 | 200 |
45+
| L1000 | 100000 | 3000 | 50 | 12 | 300 |
46+
| L3000 | 100000 | 6000 | 500 | 18 | 500 |
47+
| L5000 | 100000 | 10000 | 500 | 21 | 500 |
48+
3649
**etcd Overload Protection & Enhanced APF**
3750

3851
Tencent Cloud TKE team has developed these core protection features:
@@ -56,31 +69,39 @@ Supported versions:
5669
**playbook**: `workflow/coredns-disruption-scenario.yaml`
5770

5871
This scenario simulates coredns service disruption by:
59-
1. Scaling coredns Deployment replicas to 0
60-
2. Maintaining zero replicas for specified duration
61-
3. Restoring original replica count
72+
73+
1. **Pre-check**: Verify the existence of the `tke-chaos-test/tke-chaos-precheck-resource ConfigMap` in the target cluster to ensure the cluster is available for testing
74+
75+
2. **Component Shutdown**: Log in to the Argo Web UI, click on `coredns-disruption-scenario workflow`, then click the `RESUME` button under the `SUMMARY` tab of the `suspend-1` node to scale down the coredns Deployment replicas to 0
76+
77+
3. **Service Validation**: During the coredns disruption, you can verify whether your services are affected by the coredns disruption
78+
79+
4. **Component Recovery**: Click the `RESUME` button under the `SUMMARY` tab of the `suspend-2` node to restore the coredns Deployment replicas
6280

6381
**Parameters**
6482

6583
| Parameter | Type | Default | Description |
6684
|-----------|------|---------|-------------|
67-
| `disruption-duration` | `string` | `30s` | Disruption duration (e.g. 30s, 5m) |
6885
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | Target cluster kubeconfig secret name |
6986

7087
## kubernetes-proxy Disruption
7188

7289
**playbook**: `workflow/kubernetes-proxy-disruption-scenario.yaml`
7390

7491
This scenario simulates kubernetes-proxy service disruption by:
75-
1. Scaling kubernetes-proxy Deployment replicas to 0
76-
2. Maintaining zero replicas for specified duration
77-
3. Restoring original replica count
92+
93+
1. **Pre-check**: Verify the existence of the `tke-chaos-test/tke-chaos-precheck-resource ConfigMap` in the target cluster to ensure the cluster is available for testing
94+
95+
2. **Component Shutdown**: Log in to the Argo Web UI, click on `kubernetes-proxy-disruption-scenario workflow`, then click the `RESUME` button under the `SUMMARY` tab of the `suspend-1` node to scale down the kubernetes-proxy Deployment replicas to 0
96+
97+
3. **Service Validation**: During the kubernetes-proxy disruption, you can verify whether your services are affected by the kubernetes-proxy disruption
98+
99+
4. **Component Recovery**: Click the `RESUME` button under the `SUMMARY` tab of the `suspend-2` node to restore the kubernetes-proxy Deployment replicas
78100

79101
**Parameters**
80102

81103
| Parameter | Type | Default | Description |
82104
|-----------|------|---------|-------------|
83-
| `disruption-duration` | `string` | `30s` | Disruption duration (e.g. 30s, 5m) |
84105
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | Target cluster kubeconfig secret name |
85106

86107
## Namespace Deletion Protection

playbook/README_zh.md

Lines changed: 23 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,19 @@
3333
| `inject-stress-list-qps` | `int` | "100" | 每个发压`Pod``QPS` |
3434
| `inject-stress-total-duration` | `string` | "30s" | 发压执行总时长(如30s,5m等) |
3535

36+
**TKE集群推荐压测参数**
37+
38+
| 集群规格 | resource-create-object-size-bytes | resource-create-object-count | resource-create-qps | inject-stress-concurrency | inject-stress-list-qps |
39+
|---------|----------------------------------|-----------------------------|---------------------|--------------------------|-----------------------|
40+
| L5 | 10000 | 100 | 10 | 6 | 200 |
41+
| L50 | 10000 | 300 | 10 | 6 | 200 |
42+
| L100 | 50000 | 500 | 20 | 6 | 200 |
43+
| L200 | 100000 | 1000 | 50 | 9 | 200 |
44+
| L500 | 100000 | 1000 | 50 | 12 | 200 |
45+
| L1000 | 100000 | 3000 | 50 | 12 | 300 |
46+
| L3000 | 100000 | 6000 | 500 | 18 | 500 |
47+
| L5000 | 100000 | 10000 | 500 | 21 | 500 |
48+
3649
**etcd过载保护&增强apf限流说明**
3750

3851
腾讯云TKE团队在社区版本基础上开发了以下核心保护特性:
@@ -56,31 +69,33 @@
5669
**playbook**`workflow/coredns-disruption-scenario.yaml`
5770

5871
该场景通过以下方式构造`coredns`服务中断:
59-
1.`coredns Deployment`副本数缩容到`0`
60-
2. 维持指定时间副本数为`0`
61-
3. 恢复原有副本数
72+
73+
1. **前置检查**:验证目标集群中存在`tke-chaos-test/tke-chaos-precheck-resource ConfigMap`,确保集群可用于演练
74+
2. **组件停机**:登录argo Web UI,点击`coredns-disruption-scenario workflow`,点击`suspend-1`节点`SUMMARY`标签下的`RESUME`按钮,将`coredns Deployment`副本数缩容到`0`
75+
3. **业务验证**`coredns`停服期间,您可以去验证您的业务是否受到`cordns`停服的影响
76+
4. **组件恢复**:点击`suspend-2`节点`SUMMARY`标签下的`RESUME`按钮,将`coredns Deployment`副本数恢复
6277

6378
**参数说明**
6479

6580
| 参数名称 | 类型 | 默认值 | 说明 |
6681
|---------|------|--------|------|
67-
| `disruption-duration` | `string` | `30s` | 服务中断持续时间(如30s,5m等) |
6882
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | `目标集群kubeconfig secret`名称,如为空,则演练当前集群 |
6983

7084
## kubernetes-proxy停服
7185

7286
**playbook**`workflow/kubernetes-proxy-disruption-scenario.yaml`
7387

7488
该场景通过以下方式构造`kubernetes-proxy`服务中断:
75-
1.`kubernetes-proxy` `Deployment`副本数缩容到0
76-
2. 维持指定时间副本数为`0`
77-
3. 恢复原有副本数
89+
90+
1. **前置检查**:验证目标集群中存在`tke-chaos-test/tke-chaos-precheck-resource ConfigMap`,确保集群可用于演练
91+
2. **组件停机**:登录argo Web UI,点击`kubernetes-proxy-disruption-scenario workflow`,点击`suspend-1`节点`SUMMARY`标签下的`RESUME`按钮,将`kubernetes-proxy Deployment`副本数缩容到`0`
92+
3. **业务验证**`kubernetes-proxy`停服期间,您可以去验证您的业务是否受到`kubernetes-proxy`停服的影响
93+
4. **组件恢复**:点击`suspend-2`节点`SUMMARY`标签下的`RESUME`按钮,将`kubernetes-proxy Deployment`副本数恢复
7894

7995
**参数说明**
8096

8197
| 参数名称 | 类型 | 默认值 | 说明 |
8298
|---------|------|--------|------|
83-
| `disruption-duration` | `string` | `30s` | 服务中断持续时间(如30s,5m等) |
8499
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | `目标集群kubeconfig secret`名称,如为空,则演练当前集群 |
85100

86101
## 命名空间删除防护

playbook/all-in-one-template.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1392,8 +1392,6 @@ spec:
13921392
- name: main
13931393
inputs:
13941394
parameters:
1395-
- name: disruption-duration
1396-
description: "服务中断持续时间"
13971395
- name: workload-type
13981396
description: "要测试的工作负载类型, 可选值: daemonset/deployment/statefulset"
13991397
- name: workload-name
@@ -1409,6 +1407,8 @@ spec:
14091407
default: "tke-chaos-test"
14101408
description: "预检查配置configmap所在命名空间"
14111409
steps:
1410+
- - name: suspend-1
1411+
template: suspend
14121412
- - name: precheck
14131413
arguments:
14141414
parameters:
@@ -1437,7 +1437,7 @@ spec:
14371437
- name: kubeconfig-secret-name
14381438
value: "{{inputs.parameters.kubeconfig-secret-name}}"
14391439
template: scale-down-workload
1440-
- - name: suspend
1440+
- - name: suspend-2
14411441
template: suspend
14421442
- - name: scale-up-workload
14431443
arguments:

playbook/template/workload-disruption-template.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,6 @@ spec:
1313
- name: main
1414
inputs:
1515
parameters:
16-
- name: disruption-duration
17-
description: "服务中断持续时间"
1816
- name: workload-type
1917
description: "要测试的工作负载类型, 可选值: daemonset/deployment/statefulset"
2018
- name: workload-name
@@ -30,6 +28,8 @@ spec:
3028
default: "tke-chaos-test"
3129
description: "预检查配置configmap所在命名空间"
3230
steps:
31+
- - name: suspend-1
32+
template: suspend
3333
- - name: precheck
3434
arguments:
3535
parameters:
@@ -58,7 +58,7 @@ spec:
5858
- name: kubeconfig-secret-name
5959
value: "{{inputs.parameters.kubeconfig-secret-name}}"
6060
template: scale-down-workload
61-
- - name: suspend
61+
- - name: suspend-2
6262
template: suspend
6363
- - name: scale-up-workload
6464
arguments:

playbook/workflow/coredns-disruption-scenario.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,6 @@ spec:
1111
serviceAccountName: tke-chaos
1212
arguments:
1313
parameters:
14-
- name: disruption-duration
15-
value: "30s"
1614
- name: workload-type
1715
value: "deployment"
1816
- name: workload-name

playbook/workflow/kubernetes-proxy-disruption-scenario.yaml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,6 @@ spec:
1111
serviceAccountName: tke-chaos
1212
arguments:
1313
parameters:
14-
- name: disruption-duration
15-
value: "30s"
1614
- name: workload-type
1715
value: "deployment"
1816
- name: workload-name

0 commit comments

Comments
 (0)