Skip to content

feat(profiler): async-signal-safety sanitizer (PROF-14763)#540

Open
jbachorik wants to merge 3 commits into
mainfrom
worktree-prof-14763-as-safety-sanitizer
Open

feat(profiler): async-signal-safety sanitizer (PROF-14763)#540
jbachorik wants to merge 3 commits into
mainfrom
worktree-prof-14763-as-safety-sanitizer

Conversation

@jbachorik
Copy link
Copy Markdown
Collaborator

@jbachorik jbachorik commented May 24, 2026

What does this PR do?:
Adds a runtime async-signal-safety sanitizer for the profiler's signal handlers. All 10 installed signal handlers are wrapped with a SignalHandlerScope RAII guard that tracks nesting depth in a thread-local counter. 11 known-unsafe APIs (Dictionary inserting lookups, FlightRecorder lifecycle methods, Mutex::lock) get DEBUG_ASSERT_NOT_IN_SIGNAL() calls that abort via write(2)+_exit(1) in debug/ASAN builds if reached from signal context. Release builds are unaffected (macro compiles to ((void)0)).

Motivation:
A number of profiler crashes in recent weeks all share the same root cause: async-signal-unsafe code reachable from signal handlers, caught only in production. Today the AS-safety rules exist only in contributor memory. This sanitizer enforces them in debug/ASAN builds so violations are caught at development time, not in production.

Additional Notes:
Two commits:

  • e9d70357PROF-14764: SignalHandlerScope RAII + DEBUG_ASSERT_NOT_IN_SIGNAL() macro + 3 unit tests
  • 16baae4aPROF-14765: assertions wired into 11 unsafe call sites

Recording::addThread was audited and confirmed signal-safe — it uses ThreadIdTable::insert (atomic CAS only).

The restoreSignalHandler in os_linux.cpp / os_macos.cpp (trivial SIGSEGV/SIGBUS fallback handlers) has no SignalHandlerScope — their bodies are single-statement and call none of the asserted APIs, so no false-positive risk today. Tracked as a follow-up for completeness.

How to test the change?:

  • ./gradlew :ddprof-lib:gtestDebug — full debug suite must pass with no assertion fires
  • ./gradlew :ddprof-lib:gtestDebug_signalSafety_ut — 3 unit tests covering depth symmetry and nesting
  • ./gradlew :ddprof-lib:gtestDebug_dictionary_concurrent_ut — concurrent signal-context bounded_lookup + dump-thread clear; assertions in inserting overloads must not fire on the signal side

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: PROF-14763

@jbachorik jbachorik added the AI label May 24, 2026
@datadog-official

This comment has been minimized.

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 25, 2026

CI Test Results

Run: #26399621792 | Commit: ced3abf | Duration: 3m 40s (longest job)

32 of 32 test jobs failed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Failed Tests

musl-amd64/debug / 17-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-aarch64/debug / 8-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-amd64/debug / 11-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-aarch64/debug / 21-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-amd64/debug / 21-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-aarch64/debug / 11-librca

Job: View logs

No detailed failure information available. Check the job logs.

musl-amd64/debug / 25-librca

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 17-graal

Job: View logs

No detailed failure information available. Check the job logs.

musl-aarch64/debug / 17-librca

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 17-j9

Job: View logs

No detailed failure information available. Check the job logs.

musl-amd64/debug / 8-librca

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 21

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 8-j9

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 11-j9

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 25

Job: View logs

No detailed failure information available. Check the job logs.

musl-aarch64/debug / 25-librca

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 8-ibm

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 17

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 21-graal

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 11-j9

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 8-orcl

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 11

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 17

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 25

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 21

Job: View logs

No detailed failure information available. Check the job logs.

glibc-aarch64/debug / 25-graal

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 8

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 17-j9

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 21-graal

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 25-graal

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 8-j9

Job: View logs

No detailed failure information available. Check the job logs.

glibc-amd64/debug / 17-graal

Job: View logs

No detailed failure information available. Check the job logs.

Summary: Total: 32 | Passed: 0 | Failed: 32


Updated: 2026-05-25 12:12:23 UTC

@jbachorik jbachorik force-pushed the worktree-prof-14763-as-safety-sanitizer branch 2 times, most recently from 5e1f30d to 23b1e2d Compare May 25, 2026 11:13
@jbachorik jbachorik marked this pull request as ready for review May 25, 2026 11:38
@jbachorik jbachorik requested a review from a team as a code owner May 25, 2026 11:38
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 23b1e2d147

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ddprof-lib/src/main/cpp/profiler.cpp Outdated
Comment thread ddprof-lib/src/main/cpp/libraries.h Outdated
@jbachorik jbachorik force-pushed the worktree-prof-14763-as-safety-sanitizer branch 2 times, most recently from a3036cb to 00d3527 Compare May 25, 2026 12:14
jbachorik and others added 3 commits May 25, 2026 15:10
Adds a runtime sanitizer for the profiler's signal handlers via a
thread-local depth counter, an RAII guard at each handler entry, and
DEBUG_ASSERT_NOT_IN_SIGNAL() at known async-signal-unsafe APIs.  Aborts
with a file:line diagnostic and writes /tmp/signal-safety-violation.txt
for CI to upload as an artifact.

Three macros encapsulate the depth counter — production code never
touches the counter directly:

  SIGNAL_HANDLER_GUARD()                — RAII increment/decrement
  SIGNAL_HANDLER_GUARD_RELEASE()        — manual early release before
                                          chaining to handlers that may
                                          longjmp through us
  SIGNAL_HANDLER_UNWIND_AFTER_LONGJMP() — decrement at setjmp landing

All sanitizer machinery is compiled in only for debug/ASAN builds.  In
release builds the macros expand to no-ops and the TLS counter is not
defined — zero overhead on the stack-sampling hot path.

Wired into all 10 installed signal handlers (ITimer, ITimerJvmti,
CTimer, CTimerJvmti, PerfEvents, WallClockASGCT, WallClockJvmti,
segvHandler, busHandler, wakeupHandler).

Assertions placed at:
  Dictionary::lookup (inserting overloads), Dictionary::clear,
  Recording::writeClasses, FlightRecorder::recordDatadogSetting,
  FlightRecorder::recordHeapUsage, Mutex::lock.

Two longjmp-aware fixes:
  - J9 SIGSEGV null-pointer-check handler: segv/busHandler use
    GUARD_RELEASE() before chaining to orig_segvHandler/orig_busHandler,
    since the chained handler may siglongjmp and skip our destructor.
  - HotSpot checkFault longjmp: walkVM's setjmp landing uses
    UNWIND_AFTER_LONGJMP() to undo the increment from the unwound frame.

CI test_workflow.yml uploads /tmp/signal-safety-violation.txt as an
artifact when test jobs fail on glibc amd64/aarch64.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ntext

dlopen_hook was performing file I/O, malloc, and mutex acquisition
(parseLibraries, patch_sigaction, installHooks) potentially from
async-signal context — e.g. when the JVM's DWARF unwinder lazily loads
libgcc_s during stack walking on J9/aarch64, our PLT-patched dlopen
hook fires from within a signal handler.

Force libgcc_s.so.1 to load during Profiler::start() via a plain dlopen
on the init thread (Profiler::prewarmUnwinder).  Once libgcc_s is mapped,
the JVM's later resolve finds it already loaded and the lazy-load path
that would invoke our hook from signal context never runs.

The SONAME "libgcc_s.so.1" is hardcoded by necessity: the release build
links with -static-libgcc, so referencing _Unwind_Backtrace would not
materialize libgcc_s.so as a NEEDED dependency — only dlopen by SONAME
can map the shared object.  libgcc_s.so.1 has been the stable SONAME
since 2002; a bump would constitute a C++ ABI break.

With the lazy-load path closed, dlopen_hook can call Libraries::refresh()
unconditionally, so it always synchronously updates symbols and hooks
for newly mapped libraries.

Libraries::refresh() encapsulates updateSymbols + patch_sigaction +
installHooks + (optional) updateBuildIds.  remote_symbolication state
moves into Libraries via setRemoteSymbolication() so the refresh path
is self-contained and dlopen_hook doesn't need to reach into Profiler.

Also adds Mutex::tryLock() (wraps pthread_mutex_trylock, which is on
the POSIX async-signal-safe list) as a primitive available for future
deferred-path work.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
JUnit 5.10+ dropped Java 8 support; 5.12+ dropped Java 11; 6.x requires
Java 17.  The Java 8 and 11 CI targets (both HotSpot and J9) run the
Gradle test worker on the test JDK itself — the profiler attaches to its
own process — so JUnit Platform classes must be loadable on Java 8/11.

Revert libs.versions.toml to junit 5.9.2 / junit-platform 1.9.2 (the
last known-working pair).  Add Dependabot ignore rules to prevent
automated bumps past 5.9.x / 1.9.x.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@jbachorik jbachorik force-pushed the worktree-prof-14763-as-safety-sanitizer branch from 00d3527 to 9365886 Compare May 25, 2026 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant