Skip to content

Commit a7eddb5

Browse files
author
Tianyu Justin Yang
committedApr 5, 2022
on computing resource limits, and workflow composition
1 parent 1029cb0 commit a7eddb5

File tree

1 file changed

+23
-0
lines changed

1 file changed

+23
-0
lines changed
 

‎manual operations.md

+23
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,27 @@ Neutrino and MinBias are the major categories, and MinBias is particularly large
7878

7979
To check if a workflow has secondary inputs, look for `minbias` in the error report, or check `MCPileup` on ReqMgr.
8080

81+
# workflow composition
82+
workflows are made of **tasks**.
83+
Each task represents one step in the workflow production chain (e.g. GEN, SIM, DIGI...)
84+
85+
Tasks are divided into **jobs** to be run on a batching system (e.g. condor).
86+
87+
# Computing Resources
88+
Each workflow can only take up a limited amount of computing resource before hitting PerformanceKill walls (exit code `50664` and `50660`).
89+
90+
To better understand how computing resource limitations on workflows,
91+
one first needs to understand the [[manual operations#workflow composition|components of a workflow]].
92+
93+
The limitations are put on condor "slots".
94+
Each condor slot hosts one job from a task of the workflow.
95+
Jobs are killed if they exceed designated **CPU time** or **memory usage** for the condor slot.
96+
To overcome these limitations, one could:
97+
- split the jobs into smaller sizes so it takes less resources for them to run through
98+
- n.b. you can check the splitting algorithm used by tasks by looking at `SplittingAlgo` from the ReqMgr/JSON page of the workflow. If it is `EventAwareLumiBased`, then splitting will work; if it is `EventBased`, then splitting won't work.
99+
- increase the number of CPU cores and the amount of memory per core.
100+
- n.b. changing nCore does not currently work in console (2022-04-05).
101+
81102
# Unified status
82103
## agentfilemismatch
83104
- a grace period of 2 days before it moves to filemismatch
@@ -96,3 +117,5 @@ To check if a workflow has secondary inputs, look for `minbias` in the error rep
96117
# Standard procedure for errors
97118
## `8021-FileReadError` and `8028-FallbackFileOpenError`
98119
In general, you should check dbs to track down the root cause. It might just be opportunistic, so ACDC without excluding the error site might already work.
120+
121+
## `50660`

0 commit comments

Comments
 (0)
Please sign in to comment.