|
1 | 1 | # Manual operation procedures
|
2 | 2 |
|
3 |
| -## Things to check before submitting ACDC |
| 3 | +## Permanent procedures |
| 4 | +### Things to check before submitting ACDC |
4 | 5 |
|
5 | 6 | - check if ReqMgr status is `completed`
|
6 | 7 | - check if unified status is indeed `assistance-manual`, as sometimes there are delays in the cms-unified page
|
7 | 8 | - look for existing jira tickets, if there is ongoing discussion, make sure the issues have been resolved before submitting ACDC
|
8 | 9 | - check for solutions for specific errors in this [cheat sheet](https://docs.google.com/spreadsheets/d/12JBANxwzN0KWAV4o-yYxnA4Fe2juGXie5uiFXgF3bXo/edit#gid=0)
|
9 | 10 | - check if there is an ongoing ACDC (Action: Pending on console)
|
10 | 11 |
|
11 |
| -## Creating Jira tickets |
| 12 | +### Creating Jira tickets |
12 | 13 |
|
13 | 14 | - copy the full url of the workflow on Dima's site into description.
|
14 | 15 | for example, `https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_TOP-RunIISummer20UL18NanoAODv2-00233`
|
|
17 | 18 | - tag interested parties in the description and component filed. (e.g. pdmv-conveners)
|
18 | 19 | - **set labels** (e.g. T2_IN_TIFR, 8028-FallbackFileOpenError)
|
19 | 20 |
|
20 |
| -## dealing with irrecoverable errors |
| 21 | +### dealing with irrecoverable errors |
21 | 22 |
|
22 | 23 | - if workflow not `ReReco`, aiming 90% stats,
|
23 | 24 | - if there are recoverable errors
|
|
27 | 28 | - else, **bypass**
|
28 | 29 | - else, aiming 100% stats, same procedure as the above branch
|
29 | 30 |
|
| 31 | +## Transient procedures |
| 32 | +### `50664` and `71304` workflows |
| 33 | +- Put (`50664-PerformanceKill` OR `71304-JobKilled`) AND (label for campaign e.g. ) as jira labels |
| 34 | +- wait for response |
| 35 | +- if no response for some time, reject |
| 36 | + |
| 37 | +### `8028` for empty files |
| 38 | +When input files were produced okay, but later became empty, one will see this error message: |
| 39 | +``` |
| 40 | +Fatal Exception (Exit code: 8028) |
| 41 | +An exception of category 'FallbackFileOpenError' occurred while |
| 42 | +[0] Constructing the EventProcessor |
| 43 | +[1] Constructing input source of type PoolSource |
| 44 | +[2] Calling RootFileSequenceBase::initTheFile() |
| 45 | +Additional Info: |
| 46 | +[a] XrdCl::File::Open(name='[root://cmsxrootd.hep.wisc.edu//store/mc/RunIISummer20UL18MiniAOD/HToSSTo2Mu2Hadrons_MH125_MS1p1_ctauS100_TuneCP2_13TeV-powheg-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/2520000/A639C8D7-B843-F34E-98F7-8C077C24B6B2.root](root://cmsxrootd.hep.wisc.edu//store/mc/RunIISummer20UL18MiniAOD/HToSSTo2Mu2Hadrons_MH125_MS1p1_ctauS100_TuneCP2_13TeV-powheg-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/2520000/A639C8D7-B843-F34E-98F7-8C077C24B6B2.root)', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3011] Too many attempts to gain dfs read access to the file |
| 47 | +' (errno=3011, code=400). No additional data servers were found. |
| 48 | +[b] Last URL tried: [root://cmsxrootd.hep.wisc.edu:1094//store/mc/RunIISummer20UL18MiniAOD/HToSSTo2Mu2Hadrons_MH125_MS1p1_ctauS100_TuneCP2_13TeV-powheg-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/2520000/A639C8D7-B843-F34E-98F7-8C077C24B6B2.root?tried=](root://cmsxrootd.hep.wisc.edu:1094//store/mc/RunIISummer20UL18MiniAOD/HToSSTo2Mu2Hadrons_MH125_MS1p1_ctauS100_TuneCP2_13TeV-powheg-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/2520000/A639C8D7-B843-F34E-98F7-8C077C24B6B2.root?tried=) |
| 49 | +[c] Problematic data server: cmsxrootd.hep.wisc.edu:1094 |
| 50 | +[d] Disabled source: cmsxrootd.hep.wisc.edu:1094 |
| 51 | +[e] Input file [root://cmsxrootd.hep.wisc.edu//store/mc/RunIISummer20UL18MiniAOD/HToSSTo2Mu2Hadrons_MH125_MS1p1_ctauS100_TuneCP2_13TeV-powheg-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/2520000/A639C8D7-B843-F34E-98F7-8C077C24B6B2.root](root://cmsxrootd.hep.wisc.edu//store/mc/RunIISummer20UL18MiniAOD/HToSSTo2Mu2Hadrons_MH125_MS1p1_ctauS100_TuneCP2_13TeV-powheg-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/2520000/A639C8D7-B843-F34E-98F7-8C077C24B6B2.root) could not be opened. |
| 52 | +Fallback Input file [root://cmsxrootd.fnal.gov//store/mc/RunIISummer20UL18MiniAOD/HToSSTo2Mu2Hadrons_MH125_MS1p1_ctauS100_TuneCP2_13TeV-powheg-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/2520000/A639C8D7-B843-F34E-98F7-8C077C24B6B2.root?source=glow](root://cmsxrootd.fnal.gov//store/mc/RunIISummer20UL18MiniAOD/HToSSTo2Mu2Hadrons_MH125_MS1p1_ctauS100_TuneCP2_13TeV-powheg-pythia8/MINIAODSIM/106X_upgrade2018_realistic_v11_L1v1-v1/2520000/A639C8D7-B843-F34E-98F7-8C077C24B6B2.root?source=glow) also could not be opened. |
| 53 | +Original exception info is above; fallback exception info is below. |
| 54 | +[f] Fatal Root Error: @SUB=TStorageFactoryFile::ReadBuffer |
| 55 | +read from Storage::xread returned 0. Asked to read n bytes: 300 from offset: 0 with file size: 0 |
| 56 | +``` |
| 57 | +- don't ACDC, instead apply `EmptyInputFile` as the jira label |
| 58 | + |
30 | 59 | # Sites
|
31 | 60 |
|
32 | 61 | ## site peculiarities
|
@@ -78,12 +107,32 @@ Neutrino and MinBias are the major categories, and MinBias is particularly large
|
78 | 107 |
|
79 | 108 | To check if a workflow has secondary inputs, look for `minbias` in the error report, or check `MCPileup` on ReqMgr.
|
80 | 109 |
|
81 |
| -# workflow composition |
| 110 | +WMCore-agent will only look for input blocks that are protected by their account. |
| 111 | + |
| 112 | +# workflow information |
| 113 | + |
| 114 | +## workflow composition |
| 115 | + |
82 | 116 | workflows are made of **tasks**.
|
83 | 117 | Each task represents one step in the workflow production chain (e.g. GEN, SIM, DIGI...)
|
84 | 118 |
|
85 | 119 | Tasks are divided into **jobs** to be run on a batching system (e.g. condor).
|
86 | 120 |
|
| 121 | +## checking workflow information |
| 122 | + |
| 123 | +### prepid and request ID (workflow ID) |
| 124 | + |
| 125 | +For MC workflows, prepid is contained in the request ID |
| 126 | +FOr `ReReco` workflows, enter the request ID on `ReqMgr` to get prepid in the JSON file |
| 127 | + |
| 128 | +### worker node |
| 129 | + |
| 130 | +To check the specific node (specific machine within a site) at which the workflow was run: |
| 131 | +- enter the workflow ID on [wmstats](https://cmsweb.cern.ch/wmstats/index.html), |
| 132 | +- click on the box under the "L" column |
| 133 | +- In the list of tasks under the workflow, click the "L" column box again for the task you want to check |
| 134 | +- In the information shown at the bottom of the page, search for "worker node" |
| 135 | + |
87 | 136 | # Computing Resources
|
88 | 137 | Each workflow can only take up a limited amount of computing resource before hitting PerformanceKill walls (exit code `50664` and `50660`).
|
89 | 138 |
|
|
0 commit comments