Skip to content

Improve Windows ETW callback registration and fix issues #24877

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

yuslepukhin
Copy link
Member

@yuslepukhin yuslepukhin commented May 27, 2025

Description

  • EtwRegistrationManager. Make sure all fields initialized by a constructor
  • Register a callback object instead of a pointer to it. Store it in the map with a session unique key.
  • Register ML_Ort_Provider_Etw_Callback once for all the sessions. The first session registers, the last one to go away removes the callback to Log all sessions. For this we make callbacks ref-counted inside the map they are stored in. This is done to prevent a deadlock where active_sessions_mutex_ and callback_mutex_ are acquired from different threads in a different order.
  • Create a registration guard to remove callbacks in case InferenceSession constructor does not finish.

Motivation and Context

This PR is inspired by #24773.

Current code exhibits multiple issues.

  • EtwRegistrationManager constructor does not initialize all of the fields including the InitializationStatus.
  • Global callback object is registered and re-created by every session. Customers sometimes run thousands of models in the same sessions which results in a quadratic ETW costs. The callback object is destroyed and recreated every time a session is created.
  • There is a chance that InferenceSession constructor does not finish, and the callback would remain registered. This may result in intermittent hard to diagnose bugs.
  • active_sessions_lock_ and callback lock are not acquired/released in the same order by different threads which is a classic deadlock scenario.

  Avoid multiple global callback multiple subscription that logs all the sessions.
  Only the first session to come up registeres it, the last session to go deregisters.
  Remove LazyInitializaton() as useless. The intent seems to be to retry
  on failure, but it never retries since it checks for NotInitialized, and
  even on failure it sets status to Initialized.
  Re-work callback registration. Right now it stores pointers to callbacks,
  but a global one repeatedly getting re-created and that creates a race condition.
  Store the callbacks in the map by a unique key.
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors ETW callback registration on Windows to ensure proper initialization, single global callback registration, per-session callback keys, and safe deregistration on constructor failure.

  • Initialize all EtwRegistrationManager fields and replace raw pointers with keyed callbacks stored in maps
  • Register the global ETW provider callback once and per-session callbacks with unique string keys
  • Add an RAII guard (AutoETWDegistrar) to clean up registrations if session construction fails

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
core/session/provider_bridge_ort.cc Updated provider interface to accept a callback key and moved callback ownership
core/session/inference_session.h / inference_session.cc Added string keys, auto-deregistration guard, and global vs per-session callback logic
core/providers/qnn/qnn_telemetry.h / qnn_telemetry.cc Refactored QnnTelemetry to use unordered_map<string, callback> and keyed registration
core/providers/qnn/qnn_execution_provider.h / qnn_execution_provider.cc Switched QNN EP to generate and use a callback key for ETW registration
core/platform/windows/telemetry.h / telemetry.cc Refactored WindowsTelemetry to use keyed callbacks in an unordered_map
core/platform/windows/logging/etw_sink.h / etw_sink.cc Initialized new fields, removed old vector, and replaced callbacks with map
core/framework/execution_providers.h / execution_providers.cc Introduced string-keyed ETW callback registration in ExecutionProviders
Comments suppressed due to low confidence (2)

onnxruntime/core/session/inference_session.h:143

  • The static member callback_ML_ORT_provider_ is no longer used after moving to keyed callbacks; consider removing it to reduce dead code.
static onnxruntime::WindowsTelemetry::EtwInternalCallback callback_ML_ORT_provider_;

onnxruntime/core/session/inference_session.h:144

  • [nitpick] Naming is inconsistent with other components that use callback_etw_key_; consider renaming this member to callback_ETWSink_key_ for consistency.
std::string callback_ETWSink_provider_key_;  // Session Start Stop

Copy link
Contributor

@ivberg ivberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Dmitri!

I know you already have done this, but please also detail what testing was done and how.

Please include comments about looking at the resulting .etl file and verifying the rundown, dynamic LOGS(), and QNN logs (in the case of QNN EP) are in the .etl file

Just good for future reference for anyone modifying this code.

@yuslepukhin
Copy link
Member Author

Testing was performed using a usual build test for CPU and QNN providers.
I also used a special drop in package provided by @ivberg to run several concurrent sessions along with profiling. The .ETL file collected has the following events:

image

@snnn
Copy link
Member

snnn commented May 29, 2025

callbacks_.clear();

The WindowsTelemetry obj usually is a global var.
This callbacks_ var is also a global var(it is not a member var of the WindowsTelemetry class)
Then, how do you know which one will get destroyed first?


Refers to: onnxruntime/core/platform/windows/telemetry.cc:93 in 6ab0185. [](commit_id = 6ab0185, deletion_comment = False)

- Fixes a possibility of a deadlock between active_sessions_mutex_ and callbacks_mutex_
  due to reversed order of acquisition.
- Removes access to a global mutex from instance methods, change to atomics.
- Moves ETW members to the end of InferenceSession class.
manager.level_ = Level;
manager.keyword_ = MatchAnyKeyword;
}
std::lock_guard<std::mutex> lock(manager.provider_change_mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure that the following is intentional:

In etw_sink.cc, the provider_change_mutex_ is no longer locked inside an inner scope that is separate from InvokeCallbacks. However, this PR also makes the opposite change for qnn_telemetry.cc (i.e., puts the mutex inside an inner scope).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this matters. The code is littered with mutexes that are not always applied properly. Supposedly, that mutex would allow atomic change of all the 3 properties, but getters are piece meals.

@skottmckay
Copy link
Contributor

Are there any tests we can add to prevent regression from the new behavior?

}
std::lock_guard<std::mutex> lock(manager.provider_change_mutex_);
// Clarify: It is not clear at all why we need to assign the below 3 to the members
// had we passed them along on the stack, no mutex would be necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the mutex ensure all 3 values are updated in sync (vs. 2 concurrent requests and some get updated by the first and some the second)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants