Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] XrdAdaptor: Use inactive sources first before trying to open a new source when handling a failed request #47593

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

makortel
Copy link
Contributor

PR description:

In the context of #43162 and #46086 (comment), this PR changes the XrdAdaptor behavior for request failures. Previously, after disabling the source that resulted the request failure, it passed on the request to another active source, and if there were no active sources, it tried to open a new source. This PR changes to logic to

  • Pass request to another active source
  • If there are no active sources, pass the request to the best inactive source
  • If there are no active or inactive sources, try to open a new source

The first commit generalizes the code where the failed source is removed from the active sources container. I found it confusing that the code checked only the first 2 active sources (even if there would only ever be at most 2 active sources). Admittedly, if there would be more than 2 active sources, the commit would change the behavior, but I would argue for the better.

Resolves cms-sw/framework-team#1306

PR validation:

Code compiles.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Possibly to be backported to 15_0_X.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 13, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

// similar logic as in checkSourcesImpl()
// assume the "sort open delay" doesn't matter in case of a request failure
auto bestInactiveSource =
std::min_element(m_inactiveSources.begin(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the best quality really the lowest number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, yes. See e.g. existing code

auto bestInactiveSource =
std::min_element(eligibleInactiveSources.begin(),
eligibleInactiveSources.end(),
[](const std::shared_ptr<Source> &s1, const std::shared_ptr<Source> &s2) {
return s1->getQuality() < s2->getQuality();
});

There is some discussion about the quality metric in https://github.com/cms-sw/cmssw/blob/master/Utilities/XrdAdaptor/doc/multisource_algorithm_design.txt
and maybe it is computed in https://github.com/cms-sw/cmssw/blob/master/Utilities/XrdAdaptor/src/QualityMetric.cc, but I haven't tried to understand what is done there.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks, sounds good! 👍

@makortel
Copy link
Contributor Author

FYI @osschar

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 32KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0f2820/44973/summary.html
COMMIT: 27252f2
CMSSW: CMSSW_15_1_X_2025-03-13-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47593/44973/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 1 lines to the logs
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3927484
  • DQMHistoTests: Total failures: 10
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3927454
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 215 log files, 184 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

…r::requestFailure()

According to Brian Bockelman this was the intention, but apparently
was never coded.
@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Pull request #47593 was updated.

@makortel
Copy link
Contributor Author

With @bockjoo's test #43162 (comment) the present state of this PR results in the following actions with sources (with increased verbosity of XrdAdaptor printouts with

process.MessageLogger.cerr.threshold = "INFO"
process.MessageLogger.cerr.XrdAdaptorInternal = dict()
process.MessageLogger.cerr.XrdAdaptorLvl1 = dict()
process.MessageLogger.cerr.XrdAdaptorLvl3 = dict()

)

%MSG-i XrdAdaptorInternal:  file_open 14-Mar-2025 15:28:27 CET pre-events
Reading from new server ceph-gw10.gridpp.rl.ac.uk:1094 at site T1_UK_RAL
%MSG
%MSG-i XrdAdaptorLvl1:  file_open 14-Mar-2025 15:28:27 CET pre-events
Serving data from:    [1] ceph-gw10.gridpp.rl.ac.uk
%MSG

%MSG-i XrdAdaptorLvl3:  AfterFile 14-Mar-2025 15:28:46 CET pre-events
Looking for an additional source because the number of active sources is smaller than 2
%MSG
Trying to open URL: root://xrootd-cms.infn.it//store/data/Run2024D/ParkingDoubleMuonLowMass3/RAW/v1/000/380/564/00000/d2acc894-3939-4155-93e6-403117e74b6c.root?tried=ceph-gw10.gridpp.rl.ac.uk&triedrc=resel

%MSG-i XrdAdaptorInternal:  (NoModuleName) 14-Mar-2025 15:28:47 CET pre-events
Reading from new server ceph-svc20.gridpp.rl.ac.uk:1094 at site T1_UK_RAL
%MSG
Successfully opened new source: ceph-svc20.gridpp.rl.ac.uk:1094 (site T1_UK_RAL)
%MSG-i XrdAdaptorLvl1:  (NoModuleName) 14-Mar-2025 15:28:47 CET pre-events
Serving data from:    [1] ceph-gw10.gridpp.rl.ac.uk   [2] ceph-svc20.gridpp.rl.ac.uk
%MSG

%MSG-w XrdAdaptorLvl3:  AfterFile 14-Mar-2025 15:28:48 CET pre-events
Deactivating ceph-gw10.gridpp.rl.ac.uk from active sources because its quality (1869) is higher than 260 and 4 times larger than the other active server ceph-svc20.gridpp.rl.ac.uk (260)
%MSG
Removing ceph-gw10.gridpp.rl.ac.uk from active sources due to poor quality (1869 vs 260)
%MSG-i XrdAdaptorLvl1:  AfterFile 14-Mar-2025 15:28:48 CET pre-events
Serving data from:    [1] ceph-svc20.gridpp.rl.ac.uk
%MSG

%MSG-i XrdAdaptorLvl3:  AfterFile 14-Mar-2025 15:30:28 CET pre-events
Looking for an additional source because the number of active sources is smaller than 2
%MSG
Trying to open URL: root://xrootd-cms.infn.it//store/data/Run2024D/ParkingDoubleMuonLowMass3/RAW/v1/000/380/564/00000/d2acc894-3939-4155-93e6-403117e74b6c.root?tried=ceph-svc20.gridpp.rl.ac.uk,ceph-gw10.gridpp.rl.ac.uk&triedrc=resel

%MSG-i XrdAdaptorInternal:  (NoModuleName) 14-Mar-2025 15:30:30 CET pre-events
Reading from new server ceph-gw11.gridpp.rl.ac.uk:1094 at site T1_UK_RAL
%MSG
Successfully opened new source: ceph-gw11.gridpp.rl.ac.uk:1094 (site T1_UK_RAL)
%MSG-i XrdAdaptorLvl1:  (NoModuleName) 14-Mar-2025 15:30:30 CET pre-events
Serving data from:    [1] ceph-svc20.gridpp.rl.ac.uk   [2] ceph-gw11.gridpp.rl.ac.uk
%MSG

%MSG-w XrdAdaptorInternal:  (NoModuleName) 14-Mar-2025 15:33:43 CET pre-events
XrdRequestManager::handle(name='root://xrootd-cms.infn.it//store/data/Run2024D/ParkingDoubleMuonLowMass3/RAW/v1/000/380/564/00000/d2acc894-3939-4155-93e6-403117e74b6c.root) failure when reading from ceph-svc20.gridpp.rl.ac.uk:1094 (site T1_UK_RAL); failed with error '[ERROR] Operation expired' (errno=0, code=206).
%MSG
%MSG-w XrdAdaptorInternal:  (NoModuleName) 14-Mar-2025 15:33:43 CET pre-events
Request failure when reading from ceph-svc20.gridpp.rl.ac.uk:1094 (site T1_UK_RAL)
%MSG
%MSG-i XrdAdaptorLvl1:  (NoModuleName) 14-Mar-2025 15:33:43 CET pre-events
Serving data from:    [1] ceph-gw11.gridpp.rl.ac.uk
%MSG

%MSG-w XrdAdaptorInternal:  (NoModuleName) 14-Mar-2025 15:57:58 CET pre-events
XrdRequestManager::handle(name='root://xrootd-cms.infn.it//store/data/Run2024D/ParkingDoubleMuonLowMass3/RAW/v1/000/380/564/00000/d2acc894-3939-4155-93e6-403117e74b6c.root) failure when reading from ceph-gw11.gridpp.rl.ac.uk:1094 (site T1_UK_RAL); failed with error '[ERROR] Operation expired' (errno=0, code=206).
%MSG
%MSG-w XrdAdaptorInternal:  (NoModuleName) 14-Mar-2025 15:57:58 CET pre-events
Request failure when reading from ceph-gw11.gridpp.rl.ac.uk:1094 (site T1_UK_RAL)
%MSG
%MSG-i XrdAdaptorLvl1:  (NoModuleName) 14-Mar-2025 15:57:58 CET pre-events
Serving data from:
%MSG

%MSG-i XrdAdaptorLvl1:  (NoModuleName) 14-Mar-2025 15:57:58 CET pre-events
Serving data from:    [1] ceph-gw10.gridpp.rl.ac.uk
%MSG

%MSG-w XrdAdaptorInternal:  (NoModuleName) 14-Mar-2025 16:06:40 CET pre-events
XrdRequestManager::handle(name='root://xrootd-cms.infn.it//store/data/Run2024D/ParkingDoubleMuonLowMass3/RAW/v1/000/380/564/00000/d2acc894-3939-4155-93e6-403117e74b6c.root) failure when reading from ceph-gw10.gridpp.rl.ac.uk:1094 (site T1_UK_RAL); failed with error '[ERROR] Operation expired' (errno=0, code=206).
%MSG
%MSG-w XrdAdaptorInternal:  (NoModuleName) 14-Mar-2025 16:06:40 CET pre-events
Request failure when reading from ceph-gw10.gridpp.rl.ac.uk:1094 (site T1_UK_RAL)
%MSG
%MSG-i XrdAdaptorLvl1:  (NoModuleName) 14-Mar-2025 16:06:40 CET pre-events
Serving data from:
%MSG

Trying to open URL: root://xrootd-cms.infn.it//store/data/Run2024D/ParkingDoubleMuonLowMass3/RAW/v1/000/380/564/00000/d2acc894-3939-4155-93e6-403117e74b6c.root?tried=ceph-svc20.gridpp.rl.ac.uk,ceph-gw11.gridpp.rl.ac.uk,ceph-gw10.gridpp.rl.ac.uk
[2025-03-14 16:09:36.079737 +0100][Error  ][XRootD            ] [cms-xrd-global.cern.ch:1094] Unable to get the response to request kXR_open (file: /store/data/Run2024D/ParkingDoubleMuonLowMass3/RAW/v1/000/380/564/00000/d2acc894-3939-4155-93e6-403117e74b6c.root?tried=+1213llrxrd-redir.in2p3.fr,+1213xrootd-redic.pi.infn.it,+1213xrootd011213xrootd-cms-uk.gridpp.r>
Got failure when trying to open a new source
%MSG-w XrdAdaptorInternal:  (NoModuleName) 14-Mar-2025 16:09:36 CET pre-events
Caught a CMSSW exception when running connection recovery.
%MSG
----- Begin Fatal Exception 14-Mar-2025 16:09:36 CET-----------------------
An exception of category 'FileReadError' occurred while
   [0] Calling InputSource::getNextItemType
   [1] Reading branch EventAuxiliary
   [2] Calling XrdFile::readv()
   [3] XrdAdaptor::ClientRequest::HandleResponse() failure while running connection recovery
   [4] Handling XrdAdaptor::RequestManager::requestFailure()
   [5] In XrdAdaptor::RequestManager::OpenHandler::HandleResponseWithHosts()
Exception Message:
XrdCl::File::Open(name='root://xrootd-cms.infn.it//store/data/Run2024D/ParkingDoubleMuonLowMass3/RAW/v1/000/380/564/00000/d2acc894-3939-4155-93e6-403117e74b6c.root', flags=0x10, permissions=0660) => error '[ERROR] Operation expired' (errno=0, code=206)
   Additional Info:
      [a] Original error: '[ERROR] Operation expired' (errno=0, code=206, source=ceph-gw10.gridpp.rl.ac.uk:1094 (site T1_UK_RAL)).
      [b] Original failed source is ceph-gw10.gridpp.rl.ac.uk:1094 (site T1_UK_RAL)
      [c] Disabled source: ceph-gw10.gridpp.rl.ac.uk:1094
      [d] Disabled source: ceph-gw11.gridpp.rl.ac.uk:1094
      [e] Disabled source: ceph-svc20.gridpp.rl.ac.uk:1094
----- End Fatal Exception -------------------------------------------------

@makortel
Copy link
Contributor Author

To summarize #47593 (comment)

  • Open initial source ceph-gw10.gridpp.rl.ac.uk
  • Multi-source scanning opens a second source ceph-svc20.gridpp.rl.ac.uk
  • ceph-gw10.gridpp.rl.ac.uk is deactivated because of its quality
  • Multi-source scanning opens a third source ceph-gw11.gridpp.rl.ac.uk
  • Read request from ceph-svc20.gridpp.rl.ac.uk fails, and the source is disabled
  • Read request from ceph-gw11.gridpp.rl.ac.uk fails, and the source is disabled
  • Inactive source ceph-gw10.gridpp.rl.ac.uk is "promoted" to active (the Serving data from printout at 15:57:58)
  • Read request from ceph-gw10.gridpp.rl.ac.uk fails, and the source is disabled
  • Attempt to open a new source fails, and the job is terminated

At least this private test demonstrates the code in this PR has an impact.

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 32KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0f2820/44982/summary.html
COMMIT: 688b9dc
CMSSW: CMSSW_15_1_X_2025-03-14-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/47593/44982/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 2 lines from the logs
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3927607
  • DQMHistoTests: Total failures: 75
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3927512
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 215 log files, 184 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use inactive sources first before trying to open a new source when handling a failed request
3 participants