fix(profiler): lock-free class/endpoint/context maps via StringDictionary#524
fix(profiler): lock-free class/endpoint/context maps via StringDictionary#524jbachorik wants to merge 71 commits into
Conversation
CI Test ResultsRun: #26356468123 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 54 | Passed: 54 | Failed: 0 Updated: 2026-05-24 18:56:48 UTC |
2f23bab to
132b472
Compare
Production crash (SIGSEGV) in Recording::cleanupUnreferencedMethods, first seen in dd-trace-java 1.56.1 after PR #327 introduced method-map cleanup. cleanupUnreferencedMethods() was called after finishChunk() released the GetLoadedClasses pins, so jvmti->Deallocate(_ptr) inside ~SharedLineNumberTable could access freed line number table memory on JVMs that reclaim JVMTI allocations on class unload. Fix: detach SharedLineNumberTable from JVMTI lifetime by copying the table into a malloc'd buffer in Lookup::fillJavaMethodInfo() and freeing the JVMTI allocation immediately. SharedLineNumberTable destructor now calls free(). As defence-in-depth, finishChunk() gains a do_cleanup parameter so cleanup runs inside the GetLoadedClasses pin window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On musl/aarch64/JDK11, HotSpot's deoptimisation blob
(generate_deopt_blob in sharedRuntime_aarch64.cpp) rebuilds interpreter
frames near the compiled frame's stack boundary, corrupting the top
~224 bytes of the thread stack where start_routine_wrapper_spec's frame
lives. Two crashes follow:
(a) -fstack-protector-strong inserts a canary into any frame with a
non-trivially-destructed local (e.g. struct Cleanup); the canary
lands in the corruption zone and fires __stack_chk_fail.
(b) Even without a canary, 'return' loads the corrupted saved LR and
jumps to a garbage address.
Fix: no_stack_protector removes the canary; pthread_exit() replaces
'return' so LR is never used; cleanup is performed explicitly with
the tid read from TLS (ProfiledThread::currentTid()), which survives
frame corruption.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7844134 to
b90761e
Compare
JavaThread::~JavaThread / OSThread::~OSThread crashed on JDK 25 when the ddprof pthread_create hook delivered SIGVTALRM between Profiler::unregisterThread() returning and ProfiledThread::release() acquiring its internal guard. The signal handler called currentSignalSafe() and dereferenced the now-freed ProfiledThread. Fix: extract unregister_and_release(tid) — a noinline helper that holds a SignalBlocker for the entire unregister+release sequence. Both start_routine_wrapper and start_routine_wrapper_spec invoke it; the race window is eliminated without duplicating signal-masking logic. Same SignalBlocker pattern is applied to perfEvents_linux.cpp's pthread_setspecific_hook teardown path. thread.h guards clearCurrentThreadTLS() with #ifdef UNIT_TEST so it is absent from production builds; GtestTaskBuilder.kt adds -DUNIT_TEST to the gtest compiler flags so the guarded method compiles in tests. thread_teardown_safety_ut.cpp adds an acceptance-test suite (ThreadTeardownSafetyTest T-01..T-10) covering the full teardown lifecycle under signal load. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
b90761e to
76d919d
Compare
ASSERT_NE expands to a bare 'return;' on failure, which is a compile error in a function whose return type is void*. Use ADD_FAILURE + explicit 'return nullptr;' instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ufferedDictionary Replaces the SpinLock-guarded Dictionary instances for _class_map, _string_label_map, and _context_value_map with a new TripleBufferedDictionary that eliminates all locking from the read/write fast paths. TripleBufferedDictionary holds three Dictionary buffers cycling through three roles via a generic TripleBufferRotator<T> template: - active — receives new writes (signal handlers + JNI threads), lock-free via CAS - dump — snapshot being read by the dump thread; promoted from old active on rotate() - scratch — two rotations behind active; ready to be cleared lock-free The scratch role exists for safe lock-free reclamation: when a buffer enters that role, at least one full dump cycle has elapsed since it was last in the active or dump role. That grace period is much longer than any signal-handler or JNI-thread can plausibly outlive a stale active pointer, so the buffer can be freed without any explicit drain. bounded_lookup(size_limit=0) is signal-safe (no malloc) and checks the active buffer only — no fallback to older snapshots. Dead code removed: - _class_map_lock (SpinLock) - classMapSharedGuard() / classMapTrySharedGuard() on Profiler - tryLockSharedBounded() / BoundedOptionalSharedLockGuard on SpinLock - spinlock_bounded_ut.cpp / dictionary_concurrent_ut.cpp (subsumed by dictionary_ut.cpp) Motivation: three production crashes (fingerprint v10.DAECC680F0728EAB44F26DB0B91B703F) showed SIGSEGV in std::_Rb_tree_increment via writeCpool → writeClasses → Dictionary::collect, caused by a race between writeClasses and concurrent Dictionary::clear(). PR #516 patched it with a shared-lock that exhausted bounded CAS retries under heavy 100 µs wall-clock load on aarch64, causing class lookups to return -1 and corrupting JFR recordings. This change eliminates the lock entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
76d919d to
2105b61
Compare
- libraryPatcher_linux.cpp:151 — add __builtin_unreachable() after pthread_exit() - flightRecorder.cpp:607 — update pin-window comment to reflect malloc'd ownership - CleanupAfterClassUnloadTest.java:51 — fix Javadoc to describe both fix mechanisms - thread_teardown_safety_ut.cpp:43 — add SigGuard RAII to restore signal dispositions - thread_teardown_safety_ut.cpp:236 — gate T-07 on __GLIBC__; add musl cleanup path Co-Authored-By: muse <muse@noreply>
Agent-Logs-Url: https://github.com/DataDog/java-profiler/sessions/127448f3-9624-49bf-92f2-850b4f413a92 Co-authored-by: jbachorik <738413+jbachorik@users.noreply.github.com>
- Add pthread_cleanup_push/pop + noinline cleanup_unregister() to start_routine_wrapper_spec so ProfiledThread is released when the wrapped routine calls pthread_exit() or the thread is canceled - Extend CleanupAfterClassUnloadTest from AbstractDynamicClassTest to reuse generateClassBytecode/IsolatedClassLoader/tempFile helpers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
30a5959 to
6645120
Compare
- thread_teardown_safety_ut.cpp: guard T-01 sentinel with ASSERT_NE(kNotYetRun) to short-circuit if handler never ran; replace T-06 absolute mask assertions with relative before/after comparisons - AbstractDynamicClassTest.java: split compound Label statements to separate lines for Spotless/google-java-format compliance Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 66451207c3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
Replaces the SpinLock-guarded Dictionary instances (_class_map, _string_label_map, _context_value_map) with a new TripleBufferedDictionary that rotates three Dictionary buffers (active / dump / scratch) under a generic TripleBufferRotator<T> template. Writes go to the active buffer lock-free; the dump thread reads standby() after rotate()/rotatePersistent() drains in-flight readers via RefCountGuard. The previous bounded shared-lock primitives in spinLock.h and the related classMap*Guard() factory methods are removed, and RefCountGuard is generalised (extracted to refCountGuard.h) to operate on void* so it can protect dictionaries as well as call-trace tables.
Changes:
- New
TripleBufferRotator<T>+TripleBufferedDictionary(withrotate,rotatePersistent,clearStandby,clearAll) and a newDictionary::mergeFromused byrotatePersistent. - Generalised
RefCountGuardtovoid*and extracted it fromcallTraceStorage.hintorefCountGuard.h; removedtryLockSharedBounded/BoundedOptionalSharedLockGuardand the_class_map_lockmember. - Wired
Profiler::dump/stop/startto rotate / clearStandby / clearAll, switchedflightRecorder.cppandhotspotSupport.cppto readstandby()/ callbounded_lookup()directly without locking, replaced old gtest suites with a newdictionary_ut.cpp, added a JavaDictionaryRotationTest, and disabledContendedCallTraceStorageTeston musl/aarch64.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| ddprof-lib/src/main/cpp/tripleBuffer.h | New generic 3-buffer rotator with atomic CAS rotate. |
| ddprof-lib/src/main/cpp/refCountGuard.h | New header extracting RefCountGuard/RefCountSlot and generalising active_ptr to void*. |
| ddprof-lib/src/main/cpp/dictionary.h | Adds TripleBufferedDictionary, Dictionary::mergeFrom, counterId(), size() accessors. |
| ddprof-lib/src/main/cpp/dictionary.cpp | Implements mergeFrom (recursive re-insert via lookup). |
| ddprof-lib/src/main/cpp/spinLock.h | Removes tryLockSharedBounded and BoundedOptionalSharedLockGuard. |
| ddprof-lib/src/main/cpp/profiler.h | Replaces Dictionary members with TripleBufferedDictionary; drops _class_map_lock + guard factories. |
| ddprof-lib/src/main/cpp/profiler.cpp | Switches start/stop/dump/lookupClass to triple-buffer ops; removes shared-lock dance. |
| ddprof-lib/src/main/cpp/flightRecorder.cpp | writeCpool/writeClasses now read standby() snapshot without lock. |
| ddprof-lib/src/main/cpp/hotspot/hotspotSupport.cpp | Removes classMapTrySharedGuard; uses bounded_lookup directly in vtable-stub path. |
| ddprof-lib/src/main/cpp/callTraceStorage.h/.cpp | Removes the inline RefCountGuard definitions and updates the impl to void* resource. |
| ddprof-lib/src/test/cpp/dictionary_ut.cpp | New gtest covering rotate / clearStandby / merge / counter id / concurrent writes. |
| ddprof-lib/src/test/cpp/dictionary_concurrent_ut.cpp | Removed (covered the old shared-lock contract). |
| ddprof-lib/src/test/cpp/spinlock_bounded_ut.cpp | Removed alongside the deleted bounded shared-lock primitive. |
| ddprof-test/.../DictionaryRotationTest.java | New Java test asserting pre/post-dump separation and counter recalibration. |
| ddprof-test/.../BoundMethodHandleMetadataSizeTest.java | Drops the (now unused) counter-size assertion. |
| ddprof-test/.../ContendedCallTraceStorageTest.java | Skips on musl/aarch64 due to a separately tracked native bug. |
| AGENTS.md | Adds musl/aarch64/JDK11 frame-corruption rule for start_routine_wrapper_spec. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- AGENTS.md: remove musl/aarch64/JDK11 start_routine_wrapper_spec section - callTraceStorage.cpp: revert active_ptr/standby_ptr → active_table/standby_table - flightRecorder.cpp: add standby() explanation comment in writeClasses - javaApi.cpp: clarify 0-sentinel for StringDictionary::bounded_lookup - profiler.cpp: fix lockAll() comment (it blocks calltrace paths, not dict access) - profiler.cpp: add lookupClass thread-safety explanation Co-Authored-By: muse <muse@noreply>
|
@copilot resolve the merge conflicts in this pull request |
Resolved. I merged |
|
@codex review |
When clang++ is the compiler, -fsanitize=address links libclang_rt.asan-x86_64.so while the explicit -lasan links GCC's libasan.so.X from the gcc fallback path. Both end up in the binary's NEEDED entries, causing "incompatible ASan runtimes" at startup. -fsanitize=address/-fsanitize=undefined handle runtime linking correctly for both GCC and clang without the explicit flags. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
locateLibasan now returns libclang_rt.asan-<arch>.so when the compiler is clang, falling back to GCC's libasan only for GCC builds. configureAsan derives the -l flag from the located library filename: - clang: -lclang_rt.asan-<arch> satisfies -z defs for both __asan_* and __ubsan_* (clang's runtime includes both) and matches the runtime that -fsanitize=address links into executables — one runtime, no conflict. - gcc: -lasan -lubsan as before. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
clang -fsanitize=address on an executable statically embeds the full ASan runtime via --whole-archive libclang_rt.asan*.a. Adding an explicit -lclang_rt.asan / -lasan on top creates a second dynamic NEEDED entry, causing two __asan_init calls and 'incompatible ASan runtimes' at startup. Shared library builds legitimately need the explicit -l to satisfy -z defs. Executables do not — clang handles the runtime automatically.
StringDictionaryBuffer now maintains DICTIONARY_PAGES and DICTIONARY_BYTES counters alongside the existing DICTIONARY_KEYS/KEYS_BYTES metrics: - initCounters(offset): called by StringDictionary ctor for each buffer; counts the root SBTable that was already allocated at construction time. - insert_with_id: increments on CAS-winning overflow SBTable allocation. - clear(): decrements by the number of overflow nodes freed (root stays). This restores the memory-footprint observability that Dictionary provided via Counters::increment(DICTIONARY_PAGES/BYTES) without changing StringDictionary's correctness or signal-safety properties.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0995390da6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
What does this PR do?:
Replaces the
SpinLock-guardedDictionaryinstances for_class_map,_string_label_map, and_context_value_mapwith a newStringDictionarythat eliminates all locking from the read/write fast paths.StringDictionaryholds threeStringDictionaryBufferinstances (_a,_b,_c) cycling through three roles viaTripleBufferRotator<StringDictionaryBuffer>:rotate()clearTarget()) — two rotations behind; safe to clear byclearStandby()_next_idis a global monotonic counter starting at 1; it is never reset between rotations, so every key gets a globally stable id for the lifetime of a profiler session.rotate()— two-phase ID-preserving rotation (called underSignalBlocker + lockAllviarotateDictsAndRun()):clearTarget()->copyFrom(*old_active)— pre-populate the scratch buffer with the current active snapshot before the rotation_rot.rotate()— advances the ring: old active → dump, scratch → active, old dump → scratchRefCountGuard::waitForRefCountToClear(old_active)— wait for any JNI thread mid-lookup on old activenew_active->copyFrom(*old_active)— catch any entries inserted between Phase 1 and the drainlookupDuringDump(key)— called bywriteClasses/writeCpoolduringjfr.dump()to resolve a key that may have arrived afterrotate(). Probes dump buffer first, then active; inserts into both on a miss so the dump chunk contains the id.clearAll()— called byProfiler::start(reset=true). Sets_accepting=false(prevents newRefCountGuardcreation), drains all in-flight guards viaRefCountGuard::waitForAllRefCountsToClear(), clears all three buffers, resets_rot, resets_next_idto 1.bounded_lookup(key, len)(no size_limit) — signal-safe read-only probe of active; returns 0 on miss without inserting.rotateDictsAndRun(action)— newProfilerhelper that wraps dump/stop operations: acquiresSignalBlocker, callslockAll(), rotates all three dictionaries, runs the action (e.g.jfr.dump()), callsunlockAll(), then callsclearStandby()on each dictionary outside the lock.RefCountGuard— extracted fromCallTraceStorageinto its ownrefCountGuard.h/cpp. Now also carries anouter_ptrfield inRefCountSlotto keep the outer guard's buffer visible towaitForRefCountToClearduring nested signal delivery.waitForAllRefCountsToClear()gains a timeout warning log when the drain exceeds the limit.libraryPatcher_linux.cpp— addsrun_with_musl_cleanup()(__attribute__((noinline, no_stack_protector))) to host thepthread_cleanup_push/poppair in its own frame, keepingstruct __ptcbout ofstart_routine_wrapper_spec's DEOPT-corruption zone on musl/aarch64/JDK11.ASAN build fix (
ConfigurationPresets.kt,PlatformUtils.kt) —locateLibasan()now returnslibclang_rt.asan-<arch>.sowhen the compiler is clang (falling back to GCC'slibasanotherwise).configureAsanderives the-lflag and adds-Wl,-rpathfrom the located library. Fixes "incompatible ASan runtimes" crash (caused by two different ASan runtimes — clang's from-fsanitize=addressand GCC's from explicit-lasan— landing in the binary's NEEDED entries simultaneously).Removed:
_class_map_lock(SpinLock) andclassMapSharedGuard()/classMapTrySharedGuard()onProfilertryLockSharedBounded()andBoundedOptionalSharedLockGuardonSpinLockdictionary_concurrent_ut.cppandspinlock_bounded_ut.cpp(tests for removed locking primitives)Profiler::flushJfr()(inlined intorotateDictsAndRun)Motivation:
Three production crashes (PROF-14583, fingerprint
v10.DAECC680F0728EAB44F26DB0B91B703F, 2026-05-06 to 2026-05-08) showed SIGSEGV instd::_Rb_tree_incrementviaRecording::writeCpool→Recording::writeClasses→Dictionary::collect, caused by a race betweenwriteClassesand concurrentDictionary::clear().PR #516 patched this with a shared-lock, but that introduced
tryLockSharedBounded(5)in the signal-handler path (walkVM). Under heavy 100 µs wall-clock load on aarch64 the bounded CAS retries were consistently exhausted, causing class lookups to return -1 and corrupting JFR recordings.Supersedes PR #522.
Root Cause:
ebdcbc76(Jan 20, 2026) — structural bug:classMap()->lookup()called fromwalkVM()without_class_map_lock, racing withDictionary::clear()inProfiler::dump()andProfiler::start().d6d85eb7(Apr 23, 2026) — trigger:MallocTracerroutes every sampledmallocthroughwalkVM(), raising the collision rate to reliably hit the race in production.Additional Notes:
clearStandby()is safe without any explicit drain because the scratch buffer (clearTarget()) is two full rotations behind active — itsRefCountGuarddrain was completed by the previousrotate()call, and_state_lockserializes all JFR operations so no new cycle starts beforeclearStandby()runs.walkVM's vtable-stub class resolution remains best-effort; a proper fix via JVMTIClassPreparepre-population is left to a follow-up.How to test the change?:
:ddprof-lib:gtestDebug_stringDictionary_ut— rotation,RefCountGuard, concurrent writer safety:ddprof-lib:gtest(stress targetstress_stringDictionary) — concurrent insert/lookup/rotate stressDictionaryRotationTest(Java) — counter reset afterclearStandby; correct counts after fill-path insertsBoundMethodHandleProfilerTest(Java) — profiling smoke test for bound method handlesEndpointTest(Java) — endpoint label dictionary correctness under stop/start cyclesFor Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.