Skip to content

Commit bf3e522

Browse files
author
Tianyu Justin Yang
committed
updates
1 parent a7eddb5 commit bf3e522

File tree

2 files changed

+60
-11
lines changed

2 files changed

+60
-11
lines changed

ACDC automation ideas.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,3 @@
1-
# HLT and xrootd
2-
3-
- For a given failed task, if there is a necessary site which is drain (marked as red and bold letters in the console):
4-
- Pre-select primary xrootd (don't touch the secondary xrootd at any step)
5-
- If either `T2_CH_CERN` or `T2_CH_CERN_HLT` is among the pre-selected sites, unselect them
6-
- It would be good to display a small message if console does this procedure in order to make it clear for new operators etc. It could be something like `There is a necessary site in drain, therefore enabling xrootd. Since xrootd is enabled, unselecting CERN and HLT since HLT cannot perform remote reads. We unselect CERN too, since jobs sent to CERN can be redirected to HLT internally`
7-
81
# filter workflows for group ACDC
92

103
Sometimes the same type of error could affect multiple workflows, and the steps for manual ACDC are identical.
@@ -24,6 +17,13 @@ for example, we could say:
2417
- run ACDC without xrootd and splitting for all workflows in assistance-manual that failed at CERN sites with error 91009
2518
- kill and clone for all workflows failed at TIFR that has < 90% statistics and < 1M events
2619

20+
# HLT and xrootd
21+
22+
- For a given failed task, if there is a necessary site which is drain (marked as red and bold letters in the console):
23+
- Pre-select primary xrootd (don't touch the secondary xrootd at any step)
24+
- If either `T2_CH_CERN` or `T2_CH_CERN_HLT` is among the pre-selected sites, unselect them
25+
- It would be good to display a small message if console does this procedure in order to make it clear for new operators etc. It could be something like `There is a necessary site in drain, therefore enabling xrootd. Since xrootd is enabled, unselecting CERN and HLT since HLT cannot perform remote reads. We unselect CERN too, since jobs sent to CERN can be redirected to HLT internally`
26+
2727
# Automatic site reassignment for draining sites
2828
When sites go into drain, we have to manually reassign workflows to other sites. The rules for which sites to reassign the workflows to could be automated:
2929

manual operations.md

+53-4
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,15 @@
11
# Manual operation procedures
22

3-
## Things to check before submitting ACDC
3+
## Permanent procedures
4+
### Things to check before submitting ACDC
45

56
- check if ReqMgr status is `completed`
67
- check if unified status is indeed `assistance-manual`, as sometimes there are delays in the cms-unified page
78
- look for existing jira tickets, if there is ongoing discussion, make sure the issues have been resolved before submitting ACDC
89
- check for solutions for specific errors in this [cheat sheet](https://docs.google.com/spreadsheets/d/12JBANxwzN0KWAV4o-yYxnA4Fe2juGXie5uiFXgF3bXo/edit#gid=0)
910
- check if there is an ongoing ACDC (Action: Pending on console)
1011

11-
## Creating Jira tickets
12+
### Creating Jira tickets
1213

1314
- copy the full url of the workflow on Dima's site into description.
1415
for example, `https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_TOP-RunIISummer20UL18NanoAODv2-00233`
@@ -17,7 +18,7 @@
1718
- tag interested parties in the description and component filed. (e.g. pdmv-conveners)
1819
- **set labels** (e.g. T2_IN_TIFR, 8028-FallbackFileOpenError)
1920

20-
## dealing with irrecoverable errors
21+
### dealing with irrecoverable errors
2122

2223
- if workflow not `ReReco`, aiming 90% stats,
2324
- if there are recoverable errors
@@ -27,6 +28,34 @@
2728
- else, **bypass**
2829
- else, aiming 100% stats, same procedure as the above branch
2930

31+
## Transient procedures
32+
### `50664` and `71304` workflows
33+
- Put (`50664-PerformanceKill` OR `71304-JobKilled`) AND (label for campaign e.g. ) as jira labels
34+
- wait for response
35+
- if no response for some time, reject
36+
37+
### `8028` for empty files
38+
When input files were produced okay, but later became empty, one will see this error message:
39+
```
40+
Fatal Exception (Exit code: 8028)
41+
An exception of category 'FallbackFileOpenError' occurred while
42+
[0] Constructing the EventProcessor
43+
[1] Constructing input source of type PoolSource
44+
[2] Calling RootFileSequenceBase::initTheFile()
45+
Additional Info:
46+
[a] XrdCl::File::Open(name='[root://cmsxrootd.hep.wisc.edu//store/mc/RunIISummer20UL18MiniAOD/HToSSTo2Mu2Hadrons_MH125_MS1p1_ctauS100_TuneCP2_13TeV-powheg-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/2520000/A639C8D7-B843-F34E-98F7-8C077C24B6B2.root](root://cmsxrootd.hep.wisc.edu//store/mc/RunIISummer20UL18MiniAOD/HToSSTo2Mu2Hadrons_MH125_MS1p1_ctauS100_TuneCP2_13TeV-powheg-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/2520000/A639C8D7-B843-F34E-98F7-8C077C24B6B2.root)', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3011] Too many attempts to gain dfs read access to the file
47+
' (errno=3011, code=400). No additional data servers were found.
48+
[b] Last URL tried: [root://cmsxrootd.hep.wisc.edu:1094//store/mc/RunIISummer20UL18MiniAOD/HToSSTo2Mu2Hadrons_MH125_MS1p1_ctauS100_TuneCP2_13TeV-powheg-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/2520000/A639C8D7-B843-F34E-98F7-8C077C24B6B2.root?tried=](root://cmsxrootd.hep.wisc.edu:1094//store/mc/RunIISummer20UL18MiniAOD/HToSSTo2Mu2Hadrons_MH125_MS1p1_ctauS100_TuneCP2_13TeV-powheg-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/2520000/A639C8D7-B843-F34E-98F7-8C077C24B6B2.root?tried=)
49+
[c] Problematic data server: cmsxrootd.hep.wisc.edu:1094
50+
[d] Disabled source: cmsxrootd.hep.wisc.edu:1094
51+
[e] Input file [root://cmsxrootd.hep.wisc.edu//store/mc/RunIISummer20UL18MiniAOD/HToSSTo2Mu2Hadrons_MH125_MS1p1_ctauS100_TuneCP2_13TeV-powheg-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/2520000/A639C8D7-B843-F34E-98F7-8C077C24B6B2.root](root://cmsxrootd.hep.wisc.edu//store/mc/RunIISummer20UL18MiniAOD/HToSSTo2Mu2Hadrons_MH125_MS1p1_ctauS100_TuneCP2_13TeV-powheg-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/2520000/A639C8D7-B843-F34E-98F7-8C077C24B6B2.root) could not be opened.
52+
Fallback Input file [root://cmsxrootd.fnal.gov//store/mc/RunIISummer20UL18MiniAOD/HToSSTo2Mu2Hadrons_MH125_MS1p1_ctauS100_TuneCP2_13TeV-powheg-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/2520000/A639C8D7-B843-F34E-98F7-8C077C24B6B2.root?source=glow](root://cmsxrootd.fnal.gov//store/mc/RunIISummer20UL18MiniAOD/HToSSTo2Mu2Hadrons_MH125_MS1p1_ctauS100_TuneCP2_13TeV-powheg-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/2520000/A639C8D7-B843-F34E-98F7-8C077C24B6B2.root?source=glow) also could not be opened.
53+
Original exception info is above; fallback exception info is below.
54+
[f] Fatal Root Error: @SUB=TStorageFactoryFile::ReadBuffer
55+
read from Storage::xread returned 0. Asked to read n bytes: 300 from offset: 0 with file size: 0
56+
```
57+
- don't ACDC, instead apply `EmptyInputFile` as the jira label
58+
3059
# Sites
3160

3261
## site peculiarities
@@ -78,12 +107,32 @@ Neutrino and MinBias are the major categories, and MinBias is particularly large
78107

79108
To check if a workflow has secondary inputs, look for `minbias` in the error report, or check `MCPileup` on ReqMgr.
80109

81-
# workflow composition
110+
WMCore-agent will only look for input blocks that are protected by their account.
111+
112+
# workflow information
113+
114+
## workflow composition
115+
82116
workflows are made of **tasks**.
83117
Each task represents one step in the workflow production chain (e.g. GEN, SIM, DIGI...)
84118

85119
Tasks are divided into **jobs** to be run on a batching system (e.g. condor).
86120

121+
## checking workflow information
122+
123+
### prepid and request ID (workflow ID)
124+
125+
For MC workflows, prepid is contained in the request ID
126+
FOr `ReReco` workflows, enter the request ID on `ReqMgr` to get prepid in the JSON file
127+
128+
### worker node
129+
130+
To check the specific node (specific machine within a site) at which the workflow was run:
131+
- enter the workflow ID on [wmstats](https://cmsweb.cern.ch/wmstats/index.html),
132+
- click on the box under the "L" column
133+
- In the list of tasks under the workflow, click the "L" column box again for the task you want to check
134+
- In the information shown at the bottom of the page, search for "worker node"
135+
87136
# Computing Resources
88137
Each workflow can only take up a limited amount of computing resource before hitting PerformanceKill walls (exit code `50664` and `50660`).
89138

0 commit comments

Comments
 (0)