Skip to content

Commit 594fd36

Browse files
authored
Merge pull request #9 from SQxiaoxiaomeng/apiserver-shutdown
add tke managerd cluster master component shutdown scenario.
2 parents 9ad6fd5 + 863fbd0 commit 594fd36

13 files changed

+1218
-12
lines changed

playbook/README.md

Lines changed: 43 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -112,11 +112,48 @@ TODO
112112
## TKE Self-maintenance of Master cluster's kube-scheduler Disruption
113113
TODO
114114

115-
## Managed Cluster's kube-apiserver Disruption
116-
TODO
115+
## Managed Cluster Master Component Disruption
117116

118-
## Managed Cluster's kube-controller-manager Disruption
119-
TODO
117+
**playbooks**:
118+
1. kube-apiserver disruption: `workflow/managed-cluster-apiserver-shutdown-scenario.yaml`
119+
2. kube-controller-manager disruption: `workflow/managed-cluster-controller-manager-shutdown-scenario.yaml`
120+
3. kube-scheduler disruption: `workflow/managed-cluster-scheduler-shutdown-scenario.yaml`
120121

121-
## Managed Cluster's kube-scheduler Disruption
122-
TODO
122+
This scenario tests the disruption of managed cluster master components via Tencent Cloud API with the following workflow:
123+
124+
1. **Pre-check**: Verify the existence of `tke-chaos-test/tke-chaos-precheck-resource ConfigMap` in the target cluster to ensure the cluster is ready for testing
125+
2. **Component Shutdown**: Log in to Argo Web UI, click the `RESUME` button under the `SUMMARY` tab of the `suspend-1` node to call Tencent Cloud API for stopping the master component
126+
3. **Status Verification**: After a 20-second delay, check the master status to confirm successful shutdown
127+
4. **Business Verification**: During apiserver shutdown, verify business impact
128+
5. **Component Recovery**: Click the `RESUME` button under the `SUMMARY` tab of the `suspend-2` node to call Tencent Cloud API for restoring the master component
129+
6. **Final Verification**: After another 20-second delay, recheck the component status to confirm successful restoration, marking the end of the test
130+
131+
**Atomic Operations Library**
132+
133+
The `workflow/managed-cluster-master-component/` directory contains atomic operations for command-line environments. Each YAML file corresponds to a minimal operation (e.g. shutdown/recovery) without UI dependency.
134+
135+
For Kubernetes environments without Argo Web UI access, execute component tests directly via CLI. Example for apiserver:
136+
137+
1. apiserver shutdown:
138+
```bash
139+
kubectl create -f workflow/managed-cluster-master-component/shutdown-apiserver.yaml
140+
```
141+
142+
2. apiserver recovery:
143+
```bash
144+
kubectl create -f workflow/managed-cluster-master-component/restore-apiserver.yaml
145+
```
146+
147+
**Parameters**
148+
149+
| Parameter | Type | Default | Description |
150+
|-----------|------|---------|-------------|
151+
| `region` | `string` | <REGION> | Tencent Cloud region, e.g. `ap-guangzhou` [Region List](https://www.tencentcloud.com/document/product/213/6091?lang=en&pg=) |
152+
| `secret-id` | `string` | <SECRET_ID> | Tencent Cloud API secret ID, obtain from [API Key Management](https://console.cloud.tencent.com/cam/capi) |
153+
| `secret-key` | `string` | <SECRET_KEY> | Tencent Cloud API secret key |
154+
| `cluster-id` | `string` | <CLUSTER_ID> | Target cluster ID |
155+
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | Secret name containing target cluster kubeconfig |
156+
157+
**Notes**
158+
1. Will affect master component availability during test
159+
2. Recommended to execute in non-production environments or maintenance windows

playbook/README_zh.md

Lines changed: 46 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -112,11 +112,51 @@ TODO
112112
## TKE Master自维护集群kube-scheduler停服
113113
TODO
114114

115-
## 托管集群kube-apiserver组件停服
116-
TODO
115+
## 托管集群master组件停服
117116

118-
## 托管集群kube-controller-manager停服
119-
TODO
117+
**playbook**
118+
1. kube-apiserver停服&恢复:`workflow/managed-cluster-apiserver-shutdown-scenario.yaml`
119+
2. kube-controller-manager停服&恢复:`workflow/managed-cluster-controller-manager-shutdown-scenario.yaml`
120+
3. kube-scheduler停服&恢复:`workflow/managed-cluster-scheduler-shutdown-scenario.yaml`
120121

121-
## 托管集群kube-scheduler停服
122-
TODO
122+
该场景通过腾讯云API对托管集群的`master`组件进行停服演练,主要流程包括:
123+
124+
1. **前置检查**:验证目标集群中存在`tke-chaos-test/tke-chaos-precheck-resource ConfigMap`,确保集群可用于演练
125+
2. **组件停机**:登录argo Web UI,点击`suspend-1`节点`SUMMARY`标签下的`RESUME`按钮,调用腾讯云API停止`master`组件
126+
3. **状态验证**:延迟20秒后检查`master`状态,确保组件停机成功
127+
4. **业务验证**`apiserver`停服期间,您可以去验证您的业务是否受到`apiserver`停服的影响
128+
5. **组件恢复**:点击`suspend-2`节点`SUMMARY`标签下的`RESUME`按钮,调用腾讯云API恢复`master`组件
129+
6. **最终验证**:延迟20秒后,再次检查组件状态确保恢复成功,演练结束
130+
131+
**原子操作库**
132+
133+
`workflow/managed-cluster-master-component/`目录下是`​Master`组件停服演练的原子操作库,专为命令行环境设计,提供独立、可逆的管控单元。每个`YAML`文件对应一个最小化操作动作​(如停服/恢复),无需依赖 UI 或复杂编排。
134+
135+
若您的 Kubernetes 环境无法访问 Argo Web UI,可通过命令行直接调用原子化工作流执行组件演练。以`apiserver`停服演练为例,具体操作如下:
136+
137+
1. apiserver组件停服
138+
```bash
139+
kubectl create -f workflow/managed-cluster-master-component/shutdown-apiserver.yaml
140+
```
141+
142+
2. apiserver组件恢复
143+
```bash
144+
kubectl create -f workflow/managed-cluster-master-component/restore-apiserver.yaml
145+
```
146+
147+
**参数说明**
148+
149+
您需要修改演练`YAML`文件中`region``secret-id``secret-key``cluster-id`参数,参数说明如下:
150+
151+
| 参数名称 | 类型 | 默认值 | 说明 |
152+
|---------|------|--------|------|
153+
| `region` | `string` | <REGION> | 腾讯云地域,如`ap-guangzhou` [地域查询](https://www.tencentcloud.com/zh/document/product/213/6091) |
154+
| `secret-id` | `string` | <SECRET_ID> | 腾讯云API密钥ID, 密钥可前往官网控制台 [API密钥管理](https://console.cloud.tencent.com/cam/capi) 进行获取 |
155+
| `secret-key` | `string` | <SECRET_KEY> | 腾讯云API密钥 |
156+
| `cluster-id` | `string` | <CLUSTER_ID> | 演练集群ID |
157+
| `kubeconfig-secret-name` | `string` | `dest-cluster-kubeconfig` | 目标集群kubeconfig secret名称 |
158+
159+
**注意事项**
160+
161+
2. 演练过程中会影响集群`master`组件服务可用性
162+
3. 建议在非生产环境或维护窗口期执行

playbook/all-in-one-template.yaml

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -340,6 +340,55 @@ spec:
340340
- key: config
341341
path: config
342342

343+
---
344+
# Description: Template for interacting with Tencent Cloud TKE API
345+
# API Documentation: https://cloud.tencent.com/document/api
346+
#
347+
# Parameters:
348+
# args: JSON format parameters for TKE API call. Required fields:
349+
# - secretId: Tencent Cloud API secret ID (Manage at: https://console.cloud.tencent.com/cam/capi)
350+
# - secretKey: Tencent Cloud API secret key
351+
# - region: Cloud region (e.g. ap-guangzhou)
352+
# - clusterId: TKE cluster ID
353+
# - component: Kubernetes component name (e.g. kube-apiserver, kube-controller-manager, kube-scheduler)
354+
# - action: API action name (e.g. describe, shutdown, restore)
355+
#
356+
# Example args value:
357+
# {
358+
# "secretId": "<SECRET_ID>",
359+
# "secretKey": "<SECRET_KEY>",
360+
# "region": "ap-qingyuan",
361+
# "clusterId": "cls-12345678",
362+
# "component": "kube-apiserver",
363+
# "action": "describe"
364+
# }
365+
apiVersion: argoproj.io/v1alpha1
366+
kind: ClusterWorkflowTemplate
367+
metadata:
368+
name: tke-master-manager-template
369+
spec:
370+
entrypoint: caller
371+
templates:
372+
- name: caller
373+
inputs:
374+
parameters:
375+
- name: image
376+
default: "ccr.ccs.tencentyun.com/tkeimages/tke-chaos:v0.0.2"
377+
- name: args
378+
container:
379+
image: "{{inputs.parameters.image}}"
380+
command:
381+
- /kubestress
382+
- mastermanager
383+
- --provider=tke
384+
- --args={{inputs.parameters.args}}
385+
outputs:
386+
parameters:
387+
- name: response
388+
valueFrom:
389+
default: "null"
390+
path: /tmp/response.txt
391+
343392
---
344393
# 功能说明: 企微群通知模版
345394
# generate-apiserver-overload-test-notify-message: 生成Markdown消息
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
---
2+
# Description: Template for interacting with Tencent Cloud TKE API
3+
# API Documentation: https://cloud.tencent.com/document/api
4+
#
5+
# Parameters:
6+
# args: JSON format parameters for TKE API call. Required fields:
7+
# - secretId: Tencent Cloud API secret ID (Manage at: https://console.cloud.tencent.com/cam/capi)
8+
# - secretKey: Tencent Cloud API secret key
9+
# - region: Cloud region (e.g. ap-guangzhou)
10+
# - clusterId: TKE cluster ID
11+
# - component: Kubernetes component name (e.g. kube-apiserver, kube-controller-manager, kube-scheduler)
12+
# - action: API action name (e.g. describe, shutdown, restore)
13+
#
14+
# Example args value:
15+
# {
16+
# "secretId": "<SECRET_ID>",
17+
# "secretKey": "<SECRET_KEY>",
18+
# "region": "ap-qingyuan",
19+
# "clusterId": "cls-12345678",
20+
# "component": "kube-apiserver",
21+
# "action": "describe"
22+
# }
23+
apiVersion: argoproj.io/v1alpha1
24+
kind: ClusterWorkflowTemplate
25+
metadata:
26+
name: tke-master-manager-template
27+
spec:
28+
entrypoint: caller
29+
templates:
30+
- name: caller
31+
inputs:
32+
parameters:
33+
- name: image
34+
default: "ccr.ccs.tencentyun.com/tkeimages/tke-chaos:v0.0.2"
35+
- name: args
36+
container:
37+
image: "{{inputs.parameters.image}}"
38+
command:
39+
- /kubestress
40+
- mastermanager
41+
- --provider=tke
42+
- --args={{inputs.parameters.args}}
43+
outputs:
44+
parameters:
45+
- name: response
46+
valueFrom:
47+
default: "null"
48+
path: /tmp/response.txt
Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
---
2+
# Description: TKE Managed Cluster kube-apiserver Shutdown Test Scenario
3+
#
4+
# This workflow simulates and tests the kube-apiserver outage scenario with following steps:
5+
# 1. Performing pre-checks
6+
# 2. Shutting down the kube-apiserver component
7+
# 3. Verifying the shutdown status
8+
# 4. Restoring the kube-apiserver
9+
# 5. Verifying the restoration
10+
apiVersion: argoproj.io/v1alpha1
11+
kind: Workflow
12+
metadata:
13+
labels:
14+
apiserver-shutdown-scenario: "true"
15+
name: apiserver-shutdown-scenario
16+
namespace: tke-chaos-test
17+
spec:
18+
entrypoint: main
19+
serviceAccountName: tke-chaos
20+
arguments:
21+
parameters:
22+
- name: region # Tencent Cloud region (e.g. ap-qingyuan)
23+
value: "<REGION>"
24+
- name: cluster-id # Cluster ID
25+
value: "<CLUSTER_ID>"
26+
- name: secret-id # Tencent Cloud API secret ID
27+
value: "<SECRET_ID>"
28+
- name: secret-key # Tencent Cloud API secret key
29+
value: "<SECRET_KEY>"
30+
- name: kubeconfig-secret-name # Secret name containing target cluster's kubeconfig
31+
value: "dest-cluster-kubeconfig"
32+
- name: precheck-configmap-name # ConfigMap name for pre-check validation
33+
value: "tke-chaos-precheck-resource"
34+
- name: precheck-configmap-namespace # Namespace of pre-check ConfigMap
35+
value: "tke-chaos-test"
36+
templates:
37+
- name: main
38+
steps:
39+
- - name: precheck
40+
arguments:
41+
parameters:
42+
- name: kubeconfig-secret-name
43+
value: "{{workflow.parameters.kubeconfig-secret-name}}"
44+
- name: precheck-configmap-name
45+
value: "{{workflow.parameters.precheck-configmap-name}}"
46+
- name: precheck-configmap-namespace
47+
value: "{{workflow.parameters.precheck-configmap-namespace}}"
48+
- name: source
49+
value: |
50+
kubectl get -n {{workflow.parameters.precheck-configmap-namespace}} configmap {{workflow.parameters.precheck-configmap-name}}
51+
templateRef:
52+
name: kubectl-cmd
53+
template: kubectl-script
54+
clusterScope: true
55+
- - name: suspend-1
56+
template: suspend
57+
- - name: shutdown-apiserver
58+
arguments:
59+
parameters:
60+
- name: args
61+
value: |
62+
{
63+
"secretId": "{{workflow.parameters.secret-id}}",
64+
"secretKey": "{{workflow.parameters.secret-key}}",
65+
"region": "{{workflow.parameters.region}}",
66+
"clusterId": "{{workflow.parameters.cluster-id}}",
67+
"component": "kube-apiserver",
68+
"action": "shutdown"
69+
}
70+
templateRef:
71+
name: tke-master-manager-template
72+
template: caller
73+
clusterScope: true
74+
- - name: delay-1
75+
template: delay
76+
arguments:
77+
parameters:
78+
- name: duration
79+
value: "20s"
80+
- - name: get-apiserver-status-1
81+
arguments:
82+
parameters:
83+
- name: args
84+
value: |
85+
{
86+
"secretId": "{{workflow.parameters.secret-id}}",
87+
"secretKey": "{{workflow.parameters.secret-key}}",
88+
"region": "{{workflow.parameters.region}}",
89+
"clusterId": "{{workflow.parameters.cluster-id}}",
90+
"component": "kube-apiserver",
91+
"action": "describe"
92+
}
93+
templateRef:
94+
name: tke-master-manager-template
95+
template: caller
96+
clusterScope: true
97+
- - name: suspend-2
98+
template: suspend
99+
- - name: restore-apiserver
100+
arguments:
101+
parameters:
102+
- name: args
103+
value: |
104+
{
105+
"secretId": "{{workflow.parameters.secret-id}}",
106+
"secretKey": "{{workflow.parameters.secret-key}}",
107+
"region": "{{workflow.parameters.region}}",
108+
"clusterId": "{{workflow.parameters.cluster-id}}",
109+
"component": "kube-apiserver",
110+
"action": "restore"
111+
}
112+
templateRef:
113+
name: tke-master-manager-template
114+
template: caller
115+
clusterScope: true
116+
- - name: delay-2
117+
template: delay
118+
arguments:
119+
parameters:
120+
- name: duration
121+
value: "20s"
122+
- - name: get-apiserver-status-2
123+
arguments:
124+
parameters:
125+
- name: args
126+
value: |
127+
{
128+
"secretId": "{{workflow.parameters.secret-id}}",
129+
"secretKey": "{{workflow.parameters.secret-key}}",
130+
"region": "{{workflow.parameters.region}}",
131+
"clusterId": "{{workflow.parameters.cluster-id}}",
132+
"component": "kube-apiserver",
133+
"action": "describe"
134+
}
135+
templateRef:
136+
name: tke-master-manager-template
137+
template: caller
138+
clusterScope: true
139+
140+
- name: suspend
141+
suspend: {}
142+
143+
- name: delay
144+
inputs:
145+
parameters:
146+
- name: duration
147+
suspend:
148+
duration: "{{inputs.parameters.duration}}"

0 commit comments

Comments
 (0)