feat(profiler): async-signal-safety sanitizer (PROF-14763)#540
feat(profiler): async-signal-safety sanitizer (PROF-14763)#540jbachorik wants to merge 3 commits into
Conversation
This comment has been minimized.
This comment has been minimized.
CI Test ResultsRun: #26399621792 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Failed Testsmusl-amd64/debug / 17-librcaJob: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 8-librcaJob: View logs No detailed failure information available. Check the job logs. musl-amd64/debug / 11-librcaJob: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 21-librcaJob: View logs No detailed failure information available. Check the job logs. musl-amd64/debug / 21-librcaJob: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 11-librcaJob: View logs No detailed failure information available. Check the job logs. musl-amd64/debug / 25-librcaJob: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 17-graalJob: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 17-librcaJob: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 17-j9Job: View logs No detailed failure information available. Check the job logs. musl-amd64/debug / 8-librcaJob: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 21Job: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 8-j9Job: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 11-j9Job: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 25Job: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 25-librcaJob: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 8-ibmJob: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 17Job: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 21-graalJob: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 11-j9Job: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 8-orclJob: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 11Job: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 17Job: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 25Job: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 21Job: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 25-graalJob: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 17-j9Job: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 21-graalJob: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 25-graalJob: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 8-j9Job: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 17-graalJob: View logs No detailed failure information available. Check the job logs. Summary: Total: 32 | Passed: 0 | Failed: 32 Updated: 2026-05-25 12:12:23 UTC |
5e1f30d to
23b1e2d
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 23b1e2d147
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
a3036cb to
00d3527
Compare
Adds a runtime sanitizer for the profiler's signal handlers via a
thread-local depth counter, an RAII guard at each handler entry, and
DEBUG_ASSERT_NOT_IN_SIGNAL() at known async-signal-unsafe APIs. Aborts
with a file:line diagnostic and writes /tmp/signal-safety-violation.txt
for CI to upload as an artifact.
Three macros encapsulate the depth counter — production code never
touches the counter directly:
SIGNAL_HANDLER_GUARD() — RAII increment/decrement
SIGNAL_HANDLER_GUARD_RELEASE() — manual early release before
chaining to handlers that may
longjmp through us
SIGNAL_HANDLER_UNWIND_AFTER_LONGJMP() — decrement at setjmp landing
All sanitizer machinery is compiled in only for debug/ASAN builds. In
release builds the macros expand to no-ops and the TLS counter is not
defined — zero overhead on the stack-sampling hot path.
Wired into all 10 installed signal handlers (ITimer, ITimerJvmti,
CTimer, CTimerJvmti, PerfEvents, WallClockASGCT, WallClockJvmti,
segvHandler, busHandler, wakeupHandler).
Assertions placed at:
Dictionary::lookup (inserting overloads), Dictionary::clear,
Recording::writeClasses, FlightRecorder::recordDatadogSetting,
FlightRecorder::recordHeapUsage, Mutex::lock.
Two longjmp-aware fixes:
- J9 SIGSEGV null-pointer-check handler: segv/busHandler use
GUARD_RELEASE() before chaining to orig_segvHandler/orig_busHandler,
since the chained handler may siglongjmp and skip our destructor.
- HotSpot checkFault longjmp: walkVM's setjmp landing uses
UNWIND_AFTER_LONGJMP() to undo the increment from the unwound frame.
CI test_workflow.yml uploads /tmp/signal-safety-violation.txt as an
artifact when test jobs fail on glibc amd64/aarch64.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ntext dlopen_hook was performing file I/O, malloc, and mutex acquisition (parseLibraries, patch_sigaction, installHooks) potentially from async-signal context — e.g. when the JVM's DWARF unwinder lazily loads libgcc_s during stack walking on J9/aarch64, our PLT-patched dlopen hook fires from within a signal handler. Force libgcc_s.so.1 to load during Profiler::start() via a plain dlopen on the init thread (Profiler::prewarmUnwinder). Once libgcc_s is mapped, the JVM's later resolve finds it already loaded and the lazy-load path that would invoke our hook from signal context never runs. The SONAME "libgcc_s.so.1" is hardcoded by necessity: the release build links with -static-libgcc, so referencing _Unwind_Backtrace would not materialize libgcc_s.so as a NEEDED dependency — only dlopen by SONAME can map the shared object. libgcc_s.so.1 has been the stable SONAME since 2002; a bump would constitute a C++ ABI break. With the lazy-load path closed, dlopen_hook can call Libraries::refresh() unconditionally, so it always synchronously updates symbols and hooks for newly mapped libraries. Libraries::refresh() encapsulates updateSymbols + patch_sigaction + installHooks + (optional) updateBuildIds. remote_symbolication state moves into Libraries via setRemoteSymbolication() so the refresh path is self-contained and dlopen_hook doesn't need to reach into Profiler. Also adds Mutex::tryLock() (wraps pthread_mutex_trylock, which is on the POSIX async-signal-safe list) as a primitive available for future deferred-path work. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
JUnit 5.10+ dropped Java 8 support; 5.12+ dropped Java 11; 6.x requires Java 17. The Java 8 and 11 CI targets (both HotSpot and J9) run the Gradle test worker on the test JDK itself — the profiler attaches to its own process — so JUnit Platform classes must be loadable on Java 8/11. Revert libs.versions.toml to junit 5.9.2 / junit-platform 1.9.2 (the last known-working pair). Add Dependabot ignore rules to prevent automated bumps past 5.9.x / 1.9.x. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
00d3527 to
9365886
Compare
What does this PR do?:
Adds a runtime async-signal-safety sanitizer for the profiler's signal handlers. All 10 installed signal handlers are wrapped with a
SignalHandlerScopeRAII guard that tracks nesting depth in a thread-local counter. 11 known-unsafe APIs (Dictionary inserting lookups, FlightRecorder lifecycle methods,Mutex::lock) getDEBUG_ASSERT_NOT_IN_SIGNAL()calls that abort viawrite(2)+_exit(1)in debug/ASAN builds if reached from signal context. Release builds are unaffected (macro compiles to((void)0)).Motivation:
A number of profiler crashes in recent weeks all share the same root cause: async-signal-unsafe code reachable from signal handlers, caught only in production. Today the AS-safety rules exist only in contributor memory. This sanitizer enforces them in debug/ASAN builds so violations are caught at development time, not in production.
Additional Notes:
Two commits:
e9d70357— PROF-14764:SignalHandlerScopeRAII +DEBUG_ASSERT_NOT_IN_SIGNAL()macro + 3 unit tests16baae4a— PROF-14765: assertions wired into 11 unsafe call sitesRecording::addThreadwas audited and confirmed signal-safe — it usesThreadIdTable::insert(atomic CAS only).The
restoreSignalHandlerinos_linux.cpp/os_macos.cpp(trivial SIGSEGV/SIGBUS fallback handlers) has noSignalHandlerScope— their bodies are single-statement and call none of the asserted APIs, so no false-positive risk today. Tracked as a follow-up for completeness.How to test the change?:
./gradlew :ddprof-lib:gtestDebug— full debug suite must pass with no assertion fires./gradlew :ddprof-lib:gtestDebug_signalSafety_ut— 3 unit tests covering depth symmetry and nesting./gradlew :ddprof-lib:gtestDebug_dictionary_concurrent_ut— concurrent signal-context bounded_lookup + dump-thread clear; assertions in inserting overloads must not fire on the signal sideFor Datadog employees:
@DataDog/security-design-and-guidance.