-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Improve Windows ETW callback registration and fix issues #24877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Avoid multiple global callback multiple subscription that logs all the sessions. Only the first session to come up registeres it, the last session to go deregisters. Remove LazyInitializaton() as useless. The intent seems to be to retry on failure, but it never retries since it checks for NotInitialized, and even on failure it sets status to Initialized. Re-work callback registration. Right now it stores pointers to callbacks, but a global one repeatedly getting re-created and that creates a race condition. Store the callbacks in the map by a unique key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors ETW callback registration on Windows to ensure proper initialization, single global callback registration, per-session callback keys, and safe deregistration on constructor failure.
- Initialize all
EtwRegistrationManager
fields and replace raw pointers with keyed callbacks stored in maps - Register the global ETW provider callback once and per-session callbacks with unique string keys
- Add an RAII guard (
AutoETWDegistrar
) to clean up registrations if session construction fails
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
core/session/provider_bridge_ort.cc | Updated provider interface to accept a callback key and moved callback ownership |
core/session/inference_session.h / inference_session.cc | Added string keys, auto-deregistration guard, and global vs per-session callback logic |
core/providers/qnn/qnn_telemetry.h / qnn_telemetry.cc | Refactored QnnTelemetry to use unordered_map<string, callback> and keyed registration |
core/providers/qnn/qnn_execution_provider.h / qnn_execution_provider.cc | Switched QNN EP to generate and use a callback key for ETW registration |
core/platform/windows/telemetry.h / telemetry.cc | Refactored WindowsTelemetry to use keyed callbacks in an unordered_map |
core/platform/windows/logging/etw_sink.h / etw_sink.cc | Initialized new fields, removed old vector, and replaced callbacks with map |
core/framework/execution_providers.h / execution_providers.cc | Introduced string-keyed ETW callback registration in ExecutionProviders |
Comments suppressed due to low confidence (2)
onnxruntime/core/session/inference_session.h:143
- The static member
callback_ML_ORT_provider_
is no longer used after moving to keyed callbacks; consider removing it to reduce dead code.
static onnxruntime::WindowsTelemetry::EtwInternalCallback callback_ML_ORT_provider_;
onnxruntime/core/session/inference_session.h:144
- [nitpick] Naming is inconsistent with other components that use
callback_etw_key_
; consider renaming this member tocallback_ETWSink_key_
for consistency.
std::string callback_ETWSink_provider_key_; // Session Start Stop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Dmitri!
I know you already have done this, but please also detail what testing was done and how.
Please include comments about looking at the resulting .etl file and verifying the rundown, dynamic LOGS(), and QNN logs (in the case of QNN EP) are in the .etl file
Just good for future reference for anyone modifying this code.
Testing was performed using a usual build test for CPU and QNN providers. |
The WindowsTelemetry obj usually is a global var. Refers to: onnxruntime/core/platform/windows/telemetry.cc:93 in 6ab0185. [](commit_id = 6ab0185, deletion_comment = False) |
- Fixes a possibility of a deadlock between active_sessions_mutex_ and callbacks_mutex_ due to reversed order of acquisition. - Removes access to a global mutex from instance methods, change to atomics. - Moves ETW members to the end of InferenceSession class.
manager.level_ = Level; | ||
manager.keyword_ = MatchAnyKeyword; | ||
} | ||
std::lock_guard<std::mutex> lock(manager.provider_change_mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to make sure that the following is intentional:
In etw_sink.cc, the provider_change_mutex_
is no longer locked inside an inner scope that is separate from InvokeCallbacks
. However, this PR also makes the opposite change for qnn_telemetry.cc (i.e., puts the mutex inside an inner scope).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this matters. The code is littered with mutexes that are not always applied properly. Supposedly, that mutex would allow atomic change of all the 3 properties, but getters are piece meals.
Are there any tests we can add to prevent regression from the new behavior? |
} | ||
std::lock_guard<std::mutex> lock(manager.provider_change_mutex_); | ||
// Clarify: It is not clear at all why we need to assign the below 3 to the members | ||
// had we passed them along on the stack, no mutex would be necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the mutex ensure all 3 values are updated in sync (vs. 2 concurrent requests and some get updated by the first and some the second)?
Description
EtwRegistrationManager
. Make sure all fields initialized by a constructorML_Ort_Provider_Etw_Callback
once for all the sessions. The first session registers, the last one to go away removes the callback to Log all sessions. For this we make callbacks ref-counted inside the map they are stored in. This is done to prevent a deadlock whereactive_sessions_mutex_
andcallback_mutex_
are acquired from different threads in a different order.InferenceSession
constructor does not finish.Motivation and Context
This PR is inspired by #24773.
Current code exhibits multiple issues.
EtwRegistrationManager
constructor does not initialize all of the fields including theInitializationStatus
.active_sessions_lock_
andcallback
lock are not acquired/released in the same order by different threads which is a classic deadlock scenario.