Skip to content

Commit 2cbcbb9

Browse files
authored
Merge pull request #6 from SQxiaoxiaomeng/chaos-palybook
modify workflow namespace to tke-chaos-test
2 parents 54ee21d + 5c99cda commit 2cbcbb9

17 files changed

+72
-58
lines changed

README.md

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -22,31 +22,34 @@ This project provides Kubernetes chaos testing capabilities covering scenarios l
2222

2323
**dest cluster**
2424

25-
2. Create `default/tke-chaos-precheck-resource ConfigMap` in `dest cluster` as a marker for testing eligibility, and create `tke-chaos-test-ns namespace`:
25+
2. Create `tke-chaos-test/tke-chaos-precheck-resource ConfigMap` in `dest cluster` as a marker for testing eligibility:
2626
```bash
27-
kubectl create -n default configmap tke-chaos-precheck-resource --from-literal=empty="" && kubectl create ns tke-chaos-test-ns
27+
kubectl create ns tke-chaos-test && kubectl create -n tke-chaos-test configmap tke-chaos-precheck-resource --from-literal=empty=""
2828
```
2929

3030
**src cluster**
3131

3232
3. Obtain `dest cluster`'s internal kubeconfig from Tencent Cloud TKE Console, save to `dest-cluster-kubeconfig` file, then create secret in `src cluster`:
3333
```bash
34-
kubectl create secret generic dest-cluster-kubeconfig --from-file=config=./dest-cluster-kubeconfig
34+
kubectl create ns tke-chaos-test && kubectl create -n tke-chaos-test secret generic dest-cluster-kubeconfig --from-file=config=./dest-cluster-kubeconfig
3535
```
3636

37-
4. Deploy Argo Workflow and Workflow templates in `src cluster` (skip if Argo is already deployed, [**Argo Documentation**](https://argo-workflows.readthedocs.io/en/latest/)):
37+
4. Clone this project and then deploy Argo Workflow in `src cluster` (skip if Argo is already deployed, [**Argo Documentation**](https://argo-workflows.readthedocs.io/en/latest/)):
3838
```bash
39+
# Clone this project
40+
git clone https://github.com/tkestack/tke-chaos-playbook.git && cd tke-chaos-playbook
41+
3942
# Deploy Argo Workflow
40-
kubectl create namespace tke-chaos-argo && kubectl create -f playbook/install-argo.yaml
43+
kubectl create -f playbook/install-argo.yaml
4144

4245
# Verify Argo Workflow Pod status
43-
kubectl get po -n tke-chaos-argo
46+
kubectl get po -n tke-chaos-test
4447
```
4548

46-
5. Enable public access for `tke-chaos-argo/tke-chaos-argo-workflows-server Service` in Tencent Cloud TKE Console. Access Argo Server UI at `LoadBalancer IP:2746` using credentials obtained via:
49+
5. Enable public access for `tke-chaos-test/tke-chaos-argo-workflows-server Service` in Tencent Cloud TKE Console. Access Argo Server UI at `LoadBalancer IP:2746` using credentials obtained via:
4750
```bash
4851
# Get Argo Server UI access token
49-
kubectl exec -it -n tke-chaos-argo deployment/tke-chaos-argo-workflows-server -- argo auth token
52+
kubectl exec -it -n tke-chaos-test deployment/tke-chaos-argo-workflows-server -- argo auth token
5053
```
5154

5255
![Argo Server UI](./playbook/docs/argo-server-ui.png)
@@ -67,7 +70,7 @@ kubectl create -f playbook/rabc.yaml && kubectl create -f playbook/all-in-one-te
6770
**Core Workflow Explanation**
6871

6972
- **Testing Configuration**: Before execution, you may need to configure parameters like `webhook-url` for notifications. Default values are provided so testings can run without modification. See [Scenario Parameters](playbook/README.md) for details.
70-
- **Precheck**: Before execution, `dest cluster` health is validated by checking Node and Pod health ratios. Testings are blocked if below thresholds (adjustable via `precheck-pods-health-ratio` and `precheck-nodes-health-ratio`). Also verifies existence of `default/tke-chaos-precheck-resource ConfigMap`.
73+
- **Precheck**: Before execution, `dest cluster` health is validated by checking Node and Pod health ratios. Testings are blocked if below thresholds (adjustable via `precheck-pods-health-ratio` and `precheck-nodes-health-ratio`). Also verifies existence of `tke-chaos-test/tke-chaos-precheck-resource ConfigMap`.
7174
- **Execute Testing**: During kube-apiserver overload testing, the system floods `dest cluster`'s kube-apiserver with List Pod requests to simulate high load. Monitor kube-apiserver metrics via Tencent Cloud TKE Console and observe your business Pod health during testing.
7275
- **Result Processing**: View testing results in Argo Server UI (recommended) or via `kubectl describe workflow {workflow-name}`.
7376

@@ -103,16 +106,16 @@ kubectl delete workflow {workflow-name}
103106

104107
2. How to track testing progress after starting?
105108

106-
Monitor testing progress via Argo Server UI or `kubectl get workflow`. By default, testings run in the default namespace. You can also watch fault simulation Pods via `kubectl get po -w` - Error-state Pods typically indicate testing failures that can be investigated via Pod logs.
109+
Monitor testing progress via Argo Server UI or `kubectl get -n tke-chaos-test workflow`. By default, testings run in the `tke-chaos-test` namespace. You can also watch fault simulation Pods via `kubectl get -n tke-chaos-test po -w` - Error-state Pods typically indicate testing failures that can be investigated via Pod logs.
107110

108111
3. What are common failure reasons?
109112

110-
Typical issues include: insufficient RBAC permissions for fault simulation Pods, missing `default/tke-chaos-precheck-resource ConfigMap` in target cluster, missing `tke-chaos-test-ns namespace`, or Argo workflow controller anomalies. Check fault simulation Pod or Argo Workflow Controller logs for details.
113+
Typical issues include: insufficient RBAC permissions for fault simulation Pods, missing `tke-chaos-test/tke-chaos-precheck-resource ConfigMap` in target cluster, missing `tke-chaos-test namespace`, or Argo workflow controller anomalies. Check fault simulation Pod or Argo Workflow Controller logs for details.
111114

112115
4. How to troubleshoot Argo Workflow Controller issues?
113116

114117
When workflows show no status after creation via `kubectl get workflow`, the Argo workflow-controller is likely malfunctioning. Check controller logs via:
115118
```bash
116-
kubectl logs -n tke-chaos-argo deployment/tke-chaos-argo-workflows-workflow-controller --tail 50 -f
119+
kubectl logs -n tke-chaos-test deployment/tke-chaos-argo-workflows-workflow-controller --tail 50 -f
117120
```
118121
Many cases involve insufficient RBAC permissions - modify the corresponding ClusterRole to add required permissions.

README_zh.md

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -22,31 +22,34 @@
2222

2323
**目标集群**
2424

25-
2.`目标集群`中创建`default/tke-chaos-precheck-resource ConfigMap`,该资源用于标识`目标集群`可执行演练测试,同时在`目标集群`中创建`tke-chaos-test-ns namespace`
25+
2.`目标集群`中创建`tke-chaos-test/tke-chaos-precheck-resource ConfigMap`,该资源用于标识`目标集群`可执行演练测试
2626
```bash
27-
kubectl create -n default configmap tke-chaos-precheck-resource --from-literal=empty="" && kubectl create ns tke-chaos-test-ns
27+
kubectl create ns tke-chaos-test && kubectl create -n tke-chaos-test configmap tke-chaos-precheck-resource --from-literal=empty=""
2828
```
2929

3030
**源集群**
3131

3232
3. 从腾讯云`TKE控制台`获取`目标集群`的内网接入`kubeconfig`凭证写入到`dest-cluster-kubeconfig`文件,并在`源集群`中执行如下命令创建`目标集群``kubeconfig``secret`
3333
```bash
34-
kubectl create secret generic dest-cluster-kubeconfig --from-file=config=./dest-cluster-kubeconfig
34+
kubectl create ns tke-chaos-test && kubectl create -n tke-chaos-test secret generic dest-cluster-kubeconfig --from-file=config=./dest-cluster-kubeconfig
3535
```
3636

37-
4. `源集群`中部署`Argo Workflow`和演练模版(如`Argo`已部署,则不需要重复部署`Argo`[**Argo Documentation**](https://argo-workflows.readthedocs.io/en/latest/)
37+
4. 克隆本项目,并在`源集群`中部署`Argo Workflow`(如`Argo`已部署,则不需要重复部署`Argo`[**Argo Documentation**](https://argo-workflows.readthedocs.io/en/latest/)
3838
```bash
39+
# 克隆项目
40+
git clone https://github.com/tkestack/tke-chaos-playbook.git && cd tke-chaos-playbook
41+
3942
# 部署Argo Workflow
40-
kubectl create namespace tke-chaos-argo && kubectl create -f playbook/install-argo.yaml
43+
kubectl create -f playbook/install-argo.yaml
4144

4245
# 验证Argo Workflow Pod正常运行
43-
kubectl get po -n tke-chaos-argo
46+
kubectl get po -n tke-chaos-test
4447
```
4548

46-
5. 腾讯云`TKE控制台`开启`tke-chaos-argo/tke-chaos-argo-workflows-server Service`公网访问,浏览器访问`LoadBalancer IP:2746`,执行如下命令获取的`Argo Server UI`接入凭证登录`Argo UI``Argo UI`可查看演练流程的详细信息。
49+
5. 腾讯云`TKE控制台`开启`tke-chaos-test/tke-chaos-argo-workflows-server Service`公网访问,浏览器访问`LoadBalancer IP:2746`,执行如下命令获取的`Argo Server UI`接入凭证登录`Argo UI``Argo UI`可查看演练流程的详细信息。
4750
```bash
4851
# 获取Argo Server UI接入凭证
49-
kubectl exec -it -n tke-chaos-argo deployment/tke-chaos-argo-workflows-server -- argo auth token
52+
kubectl exec -it -n tke-chaos-test deployment/tke-chaos-argo-workflows-server -- argo auth token
5053
```
5154

5255
![Argo Server UI](./playbook/docs/argo-server-ui.png)
@@ -68,7 +71,7 @@ kubectl create -f playbook/rabc.yaml && kubectl create -f playbook/all-in-one-te
6871
**核心流程说明**
6972

7073
- **演练配置**:在开始执行演练前,您可能需要配置一些演练参数,如配置`webhook-url`参数配置企微群通知,参数均提供了默认值,您可以在不修改任何参数的情况下执行演练。各演练场景参数说明见[演练场景参数配置说明](playbook/README.md)
71-
- **演练前校验**:开始执行演练之前,会对`目标集群`做健康检查校验,检查演练集群中的`Node``Pod`的健康比例,低于阈值将不允许演练,您可以通过修改,`precheck-pods-health-ratio``precheck-nodes-health-ratio`参数调整阈值。同时会校验`目标集群`中是否存在`default/tke-chaos-precheck-resource ConfigMap`,如不存在将不允许演练。
74+
- **演练前校验**:开始执行演练之前,会对`目标集群`做健康检查校验,检查演练集群中的`Node``Pod`的健康比例,低于阈值将不允许演练,您可以通过修改,`precheck-pods-health-ratio``precheck-nodes-health-ratio`参数调整阈值。同时会校验`目标集群`中是否存在`tke-chaos-test/tke-chaos-precheck-resource ConfigMap`,如不存在将不允许演练。
7275
- **执行演练**`kube-apiserver`高负载演练执行过程中,会对`目标集群``kube-apiserver`发起大量的洪泛`List Pod`请求,以模拟`kube-apiserver`高负载场景,您可以访问`腾讯云TKE控制台``目标集群`核心组件监控,查看`kube-apiserver`的负载情况。同时,您应该关注演练过程中您的业务Pod的健康状态,以验证`kube-apiserver`高负载是否会影响您的业务。
7376
- **演练结果**:您可以访问`Argo Server UI`查看演练结果(推荐),您也可以执行`kubectl describe workflow {workflow-name}`查看演练结果。
7477

@@ -105,12 +108,12 @@ kubectl delete worflow {workflow-name}
105108

106109
2. 演练开始执行后,如何知道演练执行到哪个步骤了?
107110

108-
您可以访问`Argo server UI`查看演练流程,您还可以执行`kubectl get workflow`查看工作流的执行状态。演练默认在default命名空间下执行,您还可以通过执行`kubectl get po -w`命令查看执行演练的`Pod`的执行情况,当出现`Error`状态的`Pod`时,大概率演练失败,您可以查看对应`Pod`日志进行排查。
111+
您可以访问`Argo server UI`查看演练流程,您还可以执行`kubectl get -n tke-chaos-test workflow`查看工作流的执行状态。演练在`tke-chaos-test`命名空间下执行,您还可以通过执行`kubectl get -n tke-chaos-test po -w`命令查看执行演练的`Pod`的执行情况,当出现`Error`状态的`Pod`时,大概率演练失败,您可以查看对应`Pod`日志进行排查。
109112

110113
3. 演练失败具体有哪些原因?
111114

112-
常见的错误包括如:执行演练的Pod`RBAC`权限不足问题、被测集群中用于校验的`default/tke-chaos-precheck-resource ConfigMap`不存在、被测集群中资源创建的`tke-chaos-test-ns namespace`不存在、`Argo workflow控制器`异常等。您可以排查演练`Pod``Argo Workflow Controller`的日志进行排查。
115+
常见的错误包括如:执行演练的Pod`RBAC`权限不足问题、被测集群中用于校验的`tke-chaos-test/tke-chaos-precheck-resource ConfigMap`不存在、被测集群中资源创建的`tke-chaos-test namespace`不存在、`Argo workflow控制器`异常等。您可以排查演练`Pod``Argo Workflow Controller`的日志进行排查。
113116

114117
4. `Argo Workflow Controller`异常排查?
115118

116-
当演练工作流创建后,通过`kubectl get workflow`查看工作流,工作流无状态时,此时大概率是`Argo workflow-controller`不工作,可以通过`kubectl logs -n tke-chaos-argo deployment/tke-chaos-argo-workflows-workflow-controller --tail 50 -f`查看`Argo Workflow Controller`的报错信息,很多情况下是RBAC权限不足,您应该修改对应的ClusterRole添加对应的资源权限。
119+
当演练工作流创建后,通过`kubectl get workflow`查看工作流,工作流无状态时,此时大概率是`Argo workflow-controller`不工作,可以通过`kubectl logs -n tke-chaos-test deployment/tke-chaos-argo-workflows-workflow-controller --tail 50 -f`查看`Argo Workflow Controller`的报错信息,很多情况下是RBAC权限不足,您应该修改对应的ClusterRole添加对应的资源权限。

playbook/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
**playbook**: `workflow/apiserver-overload-scenario.yaml`
88

99
This scenario simulates high load on `kube-apiserver` with the following workflow:
10-
- **Pre-check**: Performs health checks on the target cluster, verifying the health ratio of Nodes and Pods. If below threshold, the test will be aborted. You can adjust thresholds via `precheck-pods-health-ratio` and `precheck-nodes-health-ratio` parameters. Also checks for existence of `default/tke-chaos-precheck-resource ConfigMap`.
10+
- **Pre-check**: Performs health checks on the target cluster, verifying the health ratio of Nodes and Pods. If below threshold, the test will be aborted. You can adjust thresholds via `precheck-pods-health-ratio` and `precheck-nodes-health-ratio` parameters. Also checks for existence of `tke-chaos-test/tke-chaos-precheck-resource ConfigMap`.
1111
- **Resource Warm-up**: Creates resources (`pods/configmaps`) to simulate production environment scale.
1212
- **Fault Injection**: Floods apiserver with `list pod/configmaps` requests to simulate high load.
1313
- **Cleanup**: Cleans up resources created during the test.

playbook/README_zh.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
**playbook**`workflow/apiserver-overload-scenario.yaml`
88

99
该场景构造`kube-apiserver`高负载,主要流程包括:
10-
- **演练校验**:对被演练的`目标集群`做健康检查,检查演练集群中的`Node``Pod`的健康比例,低于阈值将不允许演练,您可以通过修改`precheck-pods-health-ratio``precheck-nodes-health-ratio`参数调整阈值。同时会校验`目标集群`中是否存在`default/tke-chaos-precheck-resource ConfigMap`,如不存在将不允许演练。
10+
- **演练校验**:对被演练的`目标集群`做健康检查,检查演练集群中的`Node``Pod`的健康比例,低于阈值将不允许演练,您可以通过修改`precheck-pods-health-ratio``precheck-nodes-health-ratio`参数调整阈值。同时会校验`目标集群`中是否存在`tke-chaos-test/tke-chaos-precheck-resource ConfigMap`,如不存在将不允许演练。
1111
- **资源预热**:在集群中创建资源(`pods/configmaps`),模拟现网环境资源规模。
1212
- **故障注入**:对apiserver发起洪泛`list pod/configmaps`请求,模拟`kube-apiserver`高负载压力。
1313
- **资源清理**:演练测试完成后清理演练过程中创建的资源。

playbook/all-in-one-template.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -522,7 +522,7 @@ spec:
522522
default: "tke-chaos-precheck-resource"
523523
description: "指定被演练集群中要检查的configmap名称"
524524
- name: check-configmap-namespace
525-
default: "default"
525+
default: "tke-chaos-test"
526526
description: "指定被演练集群中要检查的configmap所在的namespace"
527527
- name: pods-health-ratio
528528
default: "0.9"
@@ -719,7 +719,7 @@ spec:
719719
default: "tke-chaos-precheck-resource"
720720
description: "前置检查检测的configmap名称"
721721
- name: check-configmap-namespace
722-
default: "default"
722+
default: "tke-chaos-test"
723723
description: "前置检查检测的configmap命名空间"
724724
- name: pods-health-ratio
725725
default: "0.9"
@@ -733,7 +733,7 @@ spec:
733733
default: "ccr.ccs.tencentyun.com/tkeimages/tke-chaos:v0.0.1"
734734
description: "资源创建工具镜像"
735735
- name: resource-create-namespace
736-
default: "tke-chaos-test-ns"
736+
default: "tke-chaos-test"
737737
description: "创建资源所在的命名空间"
738738
- name: resource-create-object-type
739739
default: "pods"

playbook/install-argo.yaml

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ apiVersion: v1
44
kind: ServiceAccount
55
metadata:
66
name: tke-chaos-argo-workflows-workflow-controller
7-
namespace: "tke-chaos-argo"
7+
namespace: "tke-chaos-test"
88
labels:
99
helm.sh/chart: argo-workflows-0.45.14
1010
app.kubernetes.io/name: argo-workflows-workflow-controller
@@ -19,7 +19,7 @@ apiVersion: v1
1919
kind: ServiceAccount
2020
metadata:
2121
name: tke-chaos-argo-workflows-server
22-
namespace: "tke-chaos-argo"
22+
namespace: "tke-chaos-test"
2323
labels:
2424
helm.sh/chart: argo-workflows-0.45.14
2525
app.kubernetes.io/name: argo-workflows-server
@@ -34,7 +34,7 @@ apiVersion: v1
3434
kind: ConfigMap
3535
metadata:
3636
name: tke-chaos-argo-workflows-workflow-controller-configmap
37-
namespace: "tke-chaos-argo"
37+
namespace: "tke-chaos-test"
3838
labels:
3939
helm.sh/chart: argo-workflows-0.45.14
4040
app.kubernetes.io/name: argo-workflows-cm
@@ -3188,7 +3188,7 @@ roleRef:
31883188
subjects:
31893189
- kind: ServiceAccount
31903190
name: tke-chaos-argo-workflows-workflow-controller
3191-
namespace: "tke-chaos-argo"
3191+
namespace: "tke-chaos-test"
31923192
---
31933193
# Source: argo-workflows/templates/controller/workflow-controller-crb.yaml
31943194
apiVersion: rbac.authorization.k8s.io/v1
@@ -3210,7 +3210,7 @@ roleRef:
32103210
subjects:
32113211
- kind: ServiceAccount
32123212
name: tke-chaos-argo-workflows-workflow-controller
3213-
namespace: "tke-chaos-argo"
3213+
namespace: "tke-chaos-test"
32143214
---
32153215
# Source: argo-workflows/templates/server/server-crb.yaml
32163216
apiVersion: rbac.authorization.k8s.io/v1
@@ -3232,7 +3232,7 @@ roleRef:
32323232
subjects:
32333233
- kind: ServiceAccount
32343234
name: tke-chaos-argo-workflows-server
3235-
namespace: "tke-chaos-argo"
3235+
namespace: "tke-chaos-test"
32363236
---
32373237
# Source: argo-workflows/templates/server/server-crb.yaml
32383238
apiVersion: rbac.authorization.k8s.io/v1
@@ -3254,7 +3254,7 @@ roleRef:
32543254
subjects:
32553255
- kind: ServiceAccount
32563256
name: tke-chaos-argo-workflows-server
3257-
namespace: "tke-chaos-argo"
3257+
namespace: "tke-chaos-test"
32583258
---
32593259
# Source: argo-workflows/templates/controller/workflow-role.yaml
32603260
apiVersion: rbac.authorization.k8s.io/v1
@@ -3292,7 +3292,7 @@ metadata:
32923292
app: workflow-controller
32933293
app.kubernetes.io/managed-by: Helm
32943294
app.kubernetes.io/part-of: argo-workflows
3295-
namespace: tke-chaos-argo
3295+
namespace: tke-chaos-test
32963296
rules:
32973297
- apiGroups:
32983298
- argoproj.io
@@ -3338,22 +3338,22 @@ metadata:
33383338
app: workflow-controller
33393339
app.kubernetes.io/managed-by: Helm
33403340
app.kubernetes.io/part-of: argo-workflows
3341-
namespace: tke-chaos-argo
3341+
namespace: tke-chaos-test
33423342
roleRef:
33433343
apiGroup: rbac.authorization.k8s.io
33443344
kind: Role
33453345
name: tke-chaos-argo-workflows-workflow
33463346
subjects:
33473347
- kind: ServiceAccount
33483348
name: argo-workflow
3349-
namespace: tke-chaos-argo
3349+
namespace: tke-chaos-test
33503350
---
33513351
# Source: argo-workflows/templates/server/server-service.yaml
33523352
apiVersion: v1
33533353
kind: Service
33543354
metadata:
33553355
name: tke-chaos-argo-workflows-server
3356-
namespace: "tke-chaos-argo"
3356+
namespace: "tke-chaos-test"
33573357
labels:
33583358
helm.sh/chart: argo-workflows-0.45.14
33593359
app.kubernetes.io/name: argo-workflows-server
@@ -3378,7 +3378,7 @@ apiVersion: apps/v1
33783378
kind: Deployment
33793379
metadata:
33803380
name: tke-chaos-argo-workflows-workflow-controller
3381-
namespace: "tke-chaos-argo"
3381+
namespace: "tke-chaos-test"
33823382
labels:
33833383
helm.sh/chart: argo-workflows-0.45.14
33843384
app.kubernetes.io/name: argo-workflows-workflow-controller
@@ -3466,7 +3466,7 @@ apiVersion: apps/v1
34663466
kind: Deployment
34673467
metadata:
34683468
name: tke-chaos-argo-workflows-server
3469-
namespace: "tke-chaos-argo"
3469+
namespace: "tke-chaos-test"
34703470
labels:
34713471
helm.sh/chart: argo-workflows-0.45.14
34723472
app.kubernetes.io/name: argo-workflows-server

0 commit comments

Comments
 (0)