-
Notifications
You must be signed in to change notification settings - Fork 439
chore: leak uploader config to fix crashes on uwsgi tests #13520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 240 ± 3 ms. The average import time from base is: 241 ± 3 ms. The import time difference between this PR and base is: -1.4 ± 0.1 ms. Import time breakdownThe following import paths have shrunk:
|
BenchmarksBenchmark execution time: 2025-05-30 22:04:44 Comparing candidate commit 209e80a in PR branch Found 0 performance improvements and 2 performance regressions! Performance is the same for 501 metrics, 5 unstable metrics. scenario:iastaspectssplit-splitlines_aspect
scenario:telemetryaddmetric-1-gauge-metric-1-times
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Generally looks good to me, assuming CI is green. Just left a few non-blocking comments/remarks.
ddtrace/internal/datadog/profiling/dd_wrapper/include/uploader.hpp
Outdated
Show resolved
Hide resolved
ddtrace/internal/datadog/profiling/dd_wrapper/src/uploader_builder.cpp
Outdated
Show resolved
Hide resolved
|
||
static UploaderConfig* get_instance() | ||
{ | ||
if (instance == nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine as is, but could be trouble if we're accessing the uploader config concurrently. I don't think that's the case today, but just something to be aware of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought so too. I think we're ok as we're calling uploader_builder from a single thread, but then I also wondered why we would need a mutex here https://github.com/DataDog/dd-trace-py/blame/58bd69c5e0273349ebecc09c0d382a4e742ef2a5/ddtrace/internal/datadog/profiling/dd_wrapper/include/uploader_builder.hpp#L16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked to closely, but I guess we need that lock for now if we're updating tags concurrently. But that should, I think, only happen after we've already accessed the config at least once when starting the profiler. WDYT?
350aad9
to
f3b81dd
Compare
**tl;dr** 1. Use `uwsgi>=2.0.30` which has unbit/uwsgi#2726 2. When using `uwsgi<2.0.30` with `--lazy-apps`, also set `--skip-atexit`, if you see crashes when workers exit. ----------------------------------------------------- The following explanation happens on `uwsgi<2.0.30` When `uwsgi` is configured with `--lazy-apps`, master doesn't load the application, and simply `fork()` to create workers. Then, the workers load the application. This is different from when it's not set. Without `--lazy-apps`, master process first load the application then `fork()`. dd-trace-py profiler uses Python stdlib `atexit` module to export the profile when the process exits. The code run to export references global static `std::string` variables. However, when `--lazy-apps` option is set, they are destructed before when the profiler needs them to export the profile, and this leads to an undefined behavior. When `--lazy-apps` is not set the following sequence of events happens 1. uwsgi master process, initialize Python plugin, loads the application. 3. The application calls `import ddtrace.profiling.auto` and this leads all native extensions imported to the Python app. During the process, the constructors for global static variables are called and their destructors would be registered to be called by `atexit()` when the program exits. 4. The application also registers `Profiler.stop()` function to `uwsgi` python module's `atexit` attribute. 5. Then `uwsgi` calls `atexit(uwsgi_plugins_atexit)` to register each plugins atexit callbacks, for Python `uwsgi_python_atexit()` function. And the function will call `Py_Finalize()` 6. Master `fork()`s workers 7. When SIGTERM is received at the master, it sends SIGINT to its workers 8. In each worker, `uwsgi_python_atexit()` is called, invoking `Profiler.stop()` and `Py_Finalize()` is called, then destructors for global static variables are called. When `--lazy-apps` is set 1. uwsgi matser process still intializes Python interpreter, but skips loading the application 3. Then `uwsgi` calls `atexit(uwsgi_plugins_atexit)` to register each plugins atexit callbacks, for Python `uwsgi_python_atexit()` function. And the function will call `Py_Finalize()` 4. Master `fork()`s workers 5. Workers load the application. The application calls `import ddtrace.profiling.auto` and this leads all native extensions imported to the Python app. During the process, the constructors for global static variables are called and their destructors would be registered to be called by C `atexit()` when the program exits. 6. When `--lazy-apps` is set, ddtrace simply uses Python's `atexit` module to call `Profiler.stop()` when the program exits. 7. When SIGTERM is received at the master, it sends SIGINT to its workers 8. In each worker, since global destructors are registered later than `uwsgi_python_atexit()` , they're called first. 9. Then `Py_Finalize()` is called in part of `uwsgi_python_atexit()` calling `Profiler.stop()`. Then it will access those destructed global static strings leading to undefined behavior. To workaround this I have tried three different approaches. 1. #11125: Simply pass the needed `std::string` from Python side to native when exporting pprof file for test 2. #13520: Leak the strings needed for exporting profile to prevent them from destructed 3. #13535: Bundle all needed strings needed for exporting profile, and create them whenever we try to upload. The first approach only addresses issues from tests. The second approach mimics what's suggested in pytorch/pytorch#42125, but since there are a few other global static variables we have throughout the profiler, this still showed crashes from test. And we got similar results from the third approach, as we did from the second. Another way to completely work around this problem is setting `--skip-atexit` as suggested in ansible/ansible-ai-connect-service#248. When `--skip-atexit` is set, `uwsgi` uses `_exit()` instead of `exit()`, and the former doesn't call any functions registered with `atexit()` or `on_exit()`. So, the static variable destructors are never called, and the app no longer crashes. The profiler also doesn't flush the profile via `atexit()` when `--skip-atexit` is set, but I'd argue losing a single profile at the end is better than crashing the application even if the application is exiting. So we'd need to require our customers to set `--skip-atexit` when `--lazy-apps` is set. ## Checklist - [x] PR author has checked that all the criteria below are met - The PR description includes an overview of the change - The PR description articulates the motivation for the change - The change includes tests OR the PR description describes a testing strategy - The PR description notes risks associated with the change, if any - Newly-added code is easy to change - The change follows the [library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) - The change includes or references documentation updates if necessary - Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) ## Reviewer Checklist - [x] Reviewer has checked that all the criteria below are met - Title is accurate - All changes are related to the pull request's stated goal - Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - Testing strategy adequately addresses listed risks - Newly-added code is easy to change - Release note makes sense to a user of the library - If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
Closing per #13550 |
While testing stack v1 with the libdatadog exporter in #13464, notice the following. When
uwsgi
is invoked with--lazy-apps
, tests failed to export pprof file as they're getting corruptedfilename
here inuploader.cpp
.--lazy-apps
configuresuwsgi
to load the application in each worker instead of loading once in the primary process and callingfork()
. This also somehow changes how those workers are terminated and seemed like thefilename
was copied from astd::string
that is already destructed. It comes fromoutput_filename
field inuploader_builder.hpp
from this line.To address this problem, this PR intentionally leaks all the strings used/needed to build the uploader and export the profile, by using static singleton pattern, and explicitly creating the object with
new
. The initial attempt was in #11125, but then realized this PR would have less diff, and also addresses other strings.This might address some of the crashes that come from uwsgi with libdatadog exporter.
Checklist
Reviewer Checklist