Skip to content

chore: leak uploader config to fix crashes on uwsgi tests #13520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 67 commits into from

Conversation

taegyunkim
Copy link
Contributor

@taegyunkim taegyunkim commented May 28, 2025

While testing stack v1 with the libdatadog exporter in #13464, notice the following. When uwsgi is invoked with --lazy-apps, tests failed to export pprof file as they're getting corrupted filename here in uploader.cpp. --lazy-apps configures uwsgi to load the application in each worker instead of loading once in the primary process and calling fork(). This also somehow changes how those workers are terminated and seemed like the filename was copied from a std::string that is already destructed. It comes from output_filename field in uploader_builder.hpp from this line.

To address this problem, this PR intentionally leaks all the strings used/needed to build the uploader and export the profile, by using static singleton pattern, and explicitly creating the object with new. The initial attempt was in #11125, but then realized this PR would have less diff, and also addresses other strings.

This might address some of the crashes that come from uwsgi with libdatadog exporter.

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

@taegyunkim taegyunkim requested a review from a team as a code owner May 28, 2025 14:07
Copy link
Contributor

github-actions bot commented May 28, 2025

CODEOWNERS have been resolved as:

tests/profiling_v2/test_uwsgi.py                                        @DataDog/profiling-python
ddtrace/internal/datadog/profiling/dd_wrapper/include/uploader.hpp      @DataDog/profiling-python
ddtrace/internal/datadog/profiling/dd_wrapper/include/uploader_builder.hpp  @DataDog/profiling-python
ddtrace/internal/datadog/profiling/dd_wrapper/src/ddup_interface.cpp    @DataDog/profiling-python
ddtrace/internal/datadog/profiling/dd_wrapper/src/uploader.cpp          @DataDog/profiling-python
ddtrace/internal/datadog/profiling/dd_wrapper/src/uploader_builder.cpp  @DataDog/profiling-python

Copy link
Contributor

github-actions bot commented May 28, 2025

Bootstrap import analysis

Comparison of import times between this PR and base.

Summary

The average import time from this PR is: 240 ± 3 ms.

The average import time from base is: 241 ± 3 ms.

The import time difference between this PR and base is: -1.4 ± 0.1 ms.

Import time breakdown

The following import paths have shrunk:

ddtrace.auto 1.930 ms (0.81%)
ddtrace.bootstrap.sitecustomize 1.256 ms (0.52%)
ddtrace.bootstrap.preload 1.256 ms (0.52%)
ddtrace.internal.remoteconfig.client 0.640 ms (0.27%)
ddtrace 0.674 ms (0.28%)
ddtrace.internal._unpatched 0.022 ms (0.01%)

@pr-commenter
Copy link

pr-commenter bot commented May 28, 2025

Benchmarks

Benchmark execution time: 2025-05-30 22:04:44

Comparing candidate commit 209e80a in PR branch taegyunkim/uwsgi-test with baseline commit 7425abd in branch main.

Found 0 performance improvements and 2 performance regressions! Performance is the same for 501 metrics, 5 unstable metrics.

scenario:iastaspectssplit-splitlines_aspect

  • 🟥 execution_time [+133.000ns; +160.642ns] or [+9.099%; +10.990%]

scenario:telemetryaddmetric-1-gauge-metric-1-times

  • 🟥 execution_time [+187.623ns; +247.063ns] or [+8.856%; +11.662%]

@taegyunkim taegyunkim changed the title do not merge: main with uwsgi test chore: static singleton upload config to address issues with uwsgi May 28, 2025
@taegyunkim taegyunkim changed the title chore: static singleton upload config to address issues with uwsgi fix: static singleton upload config to address issues with uwsgi May 28, 2025
@taegyunkim taegyunkim changed the title fix: static singleton upload config to address issues with uwsgi chore: leak uploader config to address crashes on uwsgi tests May 28, 2025
@taegyunkim taegyunkim added Profiling Continous Profling changelog/no-changelog A changelog entry is not required for this PR. labels May 28, 2025
@taegyunkim taegyunkim changed the title chore: leak uploader config to address crashes on uwsgi tests chore: leak uploader config to fix crashes on uwsgi tests May 28, 2025
Copy link
Contributor

@nsrip-dd nsrip-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Generally looks good to me, assuming CI is green. Just left a few non-blocking comments/remarks.


static UploaderConfig* get_instance()
{
if (instance == nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine as is, but could be trouble if we're accessing the uploader config concurrently. I don't think that's the case today, but just something to be aware of.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought so too. I think we're ok as we're calling uploader_builder from a single thread, but then I also wondered why we would need a mutex here https://github.com/DataDog/dd-trace-py/blame/58bd69c5e0273349ebecc09c0d382a4e742ef2a5/ddtrace/internal/datadog/profiling/dd_wrapper/include/uploader_builder.hpp#L16

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked to closely, but I guess we need that lock for now if we're updating tags concurrently. But that should, I think, only happen after we've already accessed the config at least once when starting the profiler. WDYT?

@taegyunkim taegyunkim force-pushed the taegyunkim/uwsgi-test branch from 350aad9 to f3b81dd Compare May 29, 2025 22:12
taegyunkim added a commit that referenced this pull request Jun 3, 2025
**tl;dr**
1. Use `uwsgi>=2.0.30` which has
unbit/uwsgi#2726
2. When using `uwsgi<2.0.30` with `--lazy-apps`, also set
`--skip-atexit`, if you see crashes when workers exit.

-----------------------------------------------------

The following explanation happens on `uwsgi<2.0.30`

When `uwsgi` is configured with `--lazy-apps`, master doesn't load the
application, and simply `fork()` to create workers. Then, the workers
load the application. This is different from when it's not set. Without
`--lazy-apps`, master process first load the application then `fork()`.

dd-trace-py profiler uses Python stdlib `atexit` module to export the
profile when the process exits. The code run to export references global
static `std::string` variables. However, when `--lazy-apps` option is
set, they are destructed before when the profiler needs them to export
the profile, and this leads to an undefined behavior.

When `--lazy-apps` is not set the following sequence of events happens
1. uwsgi master process, initialize Python plugin, loads the
application.
3. The application calls `import ddtrace.profiling.auto` and this leads
all native extensions imported to the Python app. During the process,
the constructors for global static variables are called and their
destructors would be registered to be called by `atexit()` when the
program exits.
4. The application also registers `Profiler.stop()` function to `uwsgi`
python module's `atexit` attribute.
5. Then `uwsgi` calls `atexit(uwsgi_plugins_atexit)` to register each
plugins atexit callbacks, for Python `uwsgi_python_atexit()` function.
And the function will call `Py_Finalize()`
6. Master `fork()`s workers
7. When SIGTERM is received at the master, it sends SIGINT to its
workers
8. In each worker, `uwsgi_python_atexit()` is called, invoking
`Profiler.stop()` and `Py_Finalize()` is called, then destructors for
global static variables are called.

When `--lazy-apps` is set
1. uwsgi matser process still intializes Python interpreter, but skips
loading the application
3. Then `uwsgi` calls `atexit(uwsgi_plugins_atexit)` to register each
plugins atexit callbacks, for Python `uwsgi_python_atexit()` function.
And the function will call `Py_Finalize()`
4. Master `fork()`s workers
5. Workers load the application. The application calls `import
ddtrace.profiling.auto` and this leads all native extensions imported to
the Python app. During the process, the constructors for global static
variables are called and their destructors would be registered to be
called by C `atexit()` when the program exits.
6. When `--lazy-apps` is set, ddtrace simply uses Python's `atexit`
module to call `Profiler.stop()` when the program exits.
7. When SIGTERM is received at the master, it sends SIGINT to its
workers
8. In each worker, since global destructors are registered later than
`uwsgi_python_atexit()` , they're called first.
9. Then `Py_Finalize()` is called in part of `uwsgi_python_atexit()`
calling `Profiler.stop()`. Then it will access those destructed global
static strings leading to undefined behavior.

To workaround this I have tried three different approaches. 

1. #11125: Simply pass the needed `std::string` from Python side to
native when exporting pprof file for test
2. #13520: Leak the strings needed for exporting profile to prevent them
from destructed
3. #13535: Bundle all needed strings needed for exporting profile, and
create them whenever we try to upload.

The first approach only addresses issues from tests.

The second approach mimics what's suggested in
pytorch/pytorch#42125, but since there are a
few other global static variables we have throughout the profiler, this
still showed crashes from test.

And we got similar results from the third approach, as we did from the
second.


Another way to completely work around this problem is setting
`--skip-atexit` as suggested in
ansible/ansible-ai-connect-service#248.

When `--skip-atexit` is set, `uwsgi` uses `_exit()` instead of `exit()`,
and the former doesn't call any functions registered with `atexit()` or
`on_exit()`. So, the static variable destructors are never called, and
the app no longer crashes.

The profiler also doesn't flush the profile via `atexit()` when
`--skip-atexit` is set, but I'd argue losing a single profile at the end
is better than crashing the application even if the application is
exiting. So we'd need to require our customers to set `--skip-atexit`
when `--lazy-apps` is set.

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
@taegyunkim
Copy link
Contributor Author

Closing per #13550

@taegyunkim taegyunkim closed this Jun 3, 2025
@taegyunkim taegyunkim deleted the taegyunkim/uwsgi-test branch June 3, 2025 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog/no-changelog A changelog entry is not required for this PR. Profiling Continous Profling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants