-
Notifications
You must be signed in to change notification settings - Fork 449
feat(aiobs): add tracing hook for ray ml framework #14038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 283 ± 4 ms. The average import time from base is: 283 ± 3 ms. The import time difference between this PR and base is: 0.2 ± 0.1 ms. The difference is not statistically significant (z = 1.12). Import time breakdownThe following import paths have shrunk:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is excellent! Congratulations on getting your first PR in, @imran-at-datadog !
I would defer to the folks in the #apm-python channel on whether they would like tracer.py to live anywhere else in the repo. "Officially" distributed AI observability is not yet a separate team from LLM Observability, but we may be one day, and from that standpoint I agree with putting the file in ddtrace/aiobs
instead of ddtrace/llmobs
as you have done.
Would you be able to please also update CODEOWNERS, similar to #13770 , to make us the owners of aiobs
so that you and I can review each other's future changes to tracer.py
without needing to reach out to the Python tracer team?
And also, can you add the syntax for how you were able to pip install
a locally built version of ddtrace
to the PR description? I am not asking to nitpick, I genuinely don't remember if I ever have done this myself before.
BenchmarksBenchmark execution time: 2025-07-18 15:33:19 Comparing candidate commit 9a5e5d5 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 548 metrics, 2 unstable metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to add a test suite for this integration?
can we update CODEOWNERS
to set the appropriate team for ddtrace/aiobs
?
does this have to be done via OTel? any reason we don't want to support ddtrace-run ray ...
? we could probably support something like ray start --head --tracing-startup-hook=ddtrace.auto
which would give you full instrumentation for all integrations, profiling, and etc.
the way this is set up now you wouldn't get any of our other integrations enabled.
ead0ed0
to
dc22e14
Compare
That's already in the instructions as |
Performance SLOsPerformance SLOsBenchmark execution time: 2025-07-18 19:58:30 Comparing candidate commit abe623e in branch coreapiscenario-context_with_data_listeners
coreapiscenario-context_with_data_no_listeners
coreapiscenario-context_with_data_only_all_listeners
coreapiscenario-get_item_exists
coreapiscenario-get_item_missing
coreapiscenario-set_item
djangosimple-appsec
djangosimple-exception-replay-enabled
djangosimple-iast
djangosimple-profiler
djangosimple-span-code-origin
djangosimple-tracer
djangosimple-tracer-and-profiler
djangosimple-tracer-no-caches
djangosimple-tracer-no-databases
djangosimple-tracer-no-middleware
djangosimple-tracer-no-templates
errortrackingdjangosimple-errortracking-enabled-all
errortrackingdjangosimple-errortracking-enabled-user
errortrackingdjangosimple-tracer-enabled
errortrackingflasksqli-errortracking-enabled-all
errortrackingflasksqli-errortracking-enabled-user
errortrackingflasksqli-tracer-enabled
flasksimple-appsec-get
flasksimple-appsec-post
flasksimple-appsec-telemetry
flasksimple-debugger
flasksimple-iast-get
flasksimple-profiler
flasksimple-tracer
flasksqli-appsec-enabled
flasksqli-iast-enabled
flasksqli-tracer-enabled
httppropagationextract-all_styles_all_headers
httppropagationextract-b3_headers
httppropagationextract-b3_single_headers
httppropagationextract-datadog_tracecontext_tracestate_not_propagated_on_trace_id_no_match
httppropagationextract-datadog_tracecontext_tracestate_propagated_on_trace_id_match
httppropagationextract-empty_headers
httppropagationextract-full_t_id_datadog_headers
httppropagationextract-invalid_priority_header
httppropagationextract-invalid_span_id_header
httppropagationextract-invalid_tags_header
httppropagationextract-invalid_trace_id_header
httppropagationextract-large_header_no_matches
httppropagationextract-large_valid_headers_all
httppropagationextract-medium_header_no_matches
httppropagationextract-medium_valid_headers_all
httppropagationextract-none_propagation_style
httppropagationextract-tracecontext_headers
httppropagationextract-valid_headers_all
httppropagationextract-valid_headers_basic
httppropagationextract-wsgi_empty_headers
httppropagationextract-wsgi_invalid_priority_header
httppropagationextract-wsgi_invalid_span_id_header
httppropagationextract-wsgi_invalid_tags_header
httppropagationextract-wsgi_invalid_trace_id_header
httppropagationextract-wsgi_large_header_no_matches
httppropagationextract-wsgi_large_valid_headers_all
httppropagationextract-wsgi_medium_header_no_matches
httppropagationextract-wsgi_medium_valid_headers_all
httppropagationextract-wsgi_valid_headers_all
httppropagationextract-wsgi_valid_headers_basic
httppropagationinject-ids_only
httppropagationinject-with_all
httppropagationinject-with_dd_origin
httppropagationinject-with_priority_and_origin
httppropagationinject-with_sampling_priority
httppropagationinject-with_tags
httppropagationinject-with_tags_invalid
httppropagationinject-with_tags_max_size
iast_aspects-re_expand_aspect
iast_aspects-re_expand_noaspect
iast_aspects-re_findall_aspect
iast_aspects-re_findall_noaspect
iast_aspects-re_finditer_aspect
iast_aspects-re_finditer_noaspect
iast_aspects-re_fullmatch_aspect
iast_aspects-re_fullmatch_noaspect
iast_aspects-re_group_aspect
iast_aspects-re_group_noaspect
iast_aspects-re_groups_aspect
iast_aspects-re_groups_noaspect
iast_aspects-re_match_aspect
iast_aspects-re_match_noaspect
iast_aspects-re_search_aspect
iast_aspects-re_search_noaspect
iast_aspects-re_sub_aspect
iast_aspects-re_sub_noaspect
iast_aspects-re_subn_aspect
iast_aspects-re_subn_noaspect
iastaspects-add_aspect
iastaspects-add_inplace_aspect
iastaspects-add_inplace_noaspect
iastaspects-add_noaspect
iastaspects-bytearray_aspect
iastaspects-bytearray_extend_aspect
iastaspects-bytearray_extend_noaspect
iastaspects-bytearray_noaspect
iastaspects-bytes_aspect
iastaspects-bytes_noaspect
iastaspects-bytesio_aspect
iastaspects-bytesio_noaspect
iastaspects-capitalize_aspect
iastaspects-capitalize_noaspect
iastaspects-casefold_aspect
iastaspects-casefold_noaspect
iastaspects-decode_aspect
iastaspects-decode_noaspect
iastaspects-encode_aspect
iastaspects-encode_noaspect
iastaspects-format_aspect
iastaspects-format_map_aspect
iastaspects-format_map_noaspect
iastaspects-format_noaspect
iastaspects-index_aspect
iastaspects-index_noaspect
iastaspects-join_aspect
iastaspects-join_noaspect
iastaspects-ljust_aspect
iastaspects-ljust_noaspect
iastaspects-lower_aspect
iastaspects-lower_noaspect
iastaspects-lstrip_aspect
iastaspects-lstrip_noaspect
iastaspects-modulo_aspect
iastaspects-modulo_aspect_for_bytearray_bytearray
iastaspects-modulo_aspect_for_bytes
iastaspects-modulo_aspect_for_bytes_bytearray
iastaspects-modulo_noaspect
iastaspects-replace_aspect
iastaspects-replace_noaspect
iastaspects-repr_aspect
iastaspects-repr_noaspect
iastaspects-rstrip_aspect
iastaspects-rstrip_noaspect
iastaspects-slice_aspect
iastaspects-slice_noaspect
iastaspects-stringio_aspect
iastaspects-stringio_noaspect
iastaspects-strip_aspect
iastaspects-strip_noaspect
iastaspects-swapcase_aspect
iastaspects-swapcase_noaspect
iastaspects-title_aspect
iastaspects-title_noaspect
iastaspects-translate_aspect
iastaspects-translate_noaspect
iastaspects-upper_aspect
iastaspects-upper_noaspect
iastaspectsospath-ospathbasename_aspect
iastaspectsospath-ospathbasename_noaspect
iastaspectsospath-ospathjoin_aspect
iastaspectsospath-ospathjoin_noaspect
iastaspectsospath-ospathnormcase_aspect
iastaspectsospath-ospathnormcase_noaspect
iastaspectsospath-ospathsplit_aspect
iastaspectsospath-ospathsplit_noaspect
iastaspectsospath-ospathsplitdrive_aspect
iastaspectsospath-ospathsplitdrive_noaspect
iastaspectsospath-ospathsplitext_aspect
iastaspectsospath-ospathsplitext_noaspect
iastaspectssplit-rsplit_aspect
iastaspectssplit-rsplit_noaspect
iastaspectssplit-split_aspect
iastaspectssplit-split_noaspect
iastaspectssplit-splitlines_aspect
iastaspectssplit-splitlines_noaspect
iastpropagation-no-propagation
iastpropagation-propagation_enabled
iastpropagation-propagation_enabled_100
iastpropagation-propagation_enabled_1000
otelsdkspan-add-event
otelsdkspan-add-link
otelsdkspan-add-metrics
otelsdkspan-add-tags
otelsdkspan-get-context
otelsdkspan-is-recording
otelsdkspan-record-exception
otelsdkspan-set-status
otelsdkspan-start
otelsdkspan-start-finish
otelsdkspan-start-finish-telemetry
otelsdkspan-update-name
otelspan-add-event
otelspan-add-metrics
otelspan-add-tags
otelspan-get-context
otelspan-is-recording
otelspan-record-exception
otelspan-set-status
otelspan-start
otelspan-start-finish
otelspan-start-finish-telemetry
otelspan-update-name
packagespackageforrootmodulemapping-cache_off
packagespackageforrootmodulemapping-cache_on
packagesupdateimporteddependencies-import_many
packagesupdateimporteddependencies-import_many_cached
packagesupdateimporteddependencies-import_many_stdlib
packagesupdateimporteddependencies-import_many_stdlib_cached
packagesupdateimporteddependencies-import_many_unknown
packagesupdateimporteddependencies-import_many_unknown_cached
packagesupdateimporteddependencies-import_one
packagesupdateimporteddependencies-import_one_cache
packagesupdateimporteddependencies-import_one_stdlib
packagesupdateimporteddependencies-import_one_stdlib_cache
packagesupdateimporteddependencies-import_one_unknown
packagesupdateimporteddependencies-import_one_unknown_cache
ratelimiter-defaults
ratelimiter-high_rate_limit
ratelimiter-long_window
ratelimiter-low_rate_limit
ratelimiter-no_rate_limit
ratelimiter-short_window
recursivecomputation-deep
recursivecomputation-deep-profiled
recursivecomputation-medium
recursivecomputation-shallow
samplingrules-average_match
samplingrules-high_match
samplingrules-low_match
samplingrules-very_low_match
sethttpmeta-all-disabled
sethttpmeta-all-enabled
sethttpmeta-collectipvariant_exists
sethttpmeta-no-collectipvariant
sethttpmeta-no-useragentvariant
sethttpmeta-obfuscation-no-query
sethttpmeta-obfuscation-regular-case-explicit-query
sethttpmeta-obfuscation-regular-case-implicit-query
sethttpmeta-obfuscation-send-querystring-disabled
sethttpmeta-obfuscation-worst-case-explicit-query
sethttpmeta-obfuscation-worst-case-implicit-query
sethttpmeta-useragentvariant_exists_1
sethttpmeta-useragentvariant_exists_2
sethttpmeta-useragentvariant_exists_3
sethttpmeta-useragentvariant_not_exists_1
sethttpmeta-useragentvariant_not_exists_2
span-add-event
span-add-metrics
span-add-tags
span-get-context
span-is-recording
span-record-exception
span-set-status
span-start
span-start-finish
span-start-finish-telemetry
span-start-finish-traceid128
span-start-traceid128
span-update-name
telemetryaddmetric-1-count-metric-1-times
telemetryaddmetric-1-count-metrics-100-times
telemetryaddmetric-1-distribution-metric-1-times
telemetryaddmetric-1-distribution-metrics-100-times
telemetryaddmetric-1-gauge-metric-1-times
telemetryaddmetric-1-gauge-metrics-100-times
telemetryaddmetric-1-rate-metric-1-times
telemetryaddmetric-1-rate-metrics-100-times
telemetryaddmetric-100-count-metrics-100-times
telemetryaddmetric-100-distribution-metrics-100-times
telemetryaddmetric-100-gauge-metrics-100-times
telemetryaddmetric-100-rate-metrics-100-times
telemetryaddmetric-flush-1-metric
telemetryaddmetric-flush-100-metrics
telemetryaddmetric-flush-1000-metrics
tracer-large
tracer-medium
tracer-small
Legend:
Note: All comparisons are against the mean unless a different statistic (e.g., p95) is explicitly shown. |
No it doesn't have to be done this way. We have discussed various approaches including instrumentation. This is not intended to be the final solution yet, but something we can explore and iterate on. The open telemetry approach might get us more structured information since Ray might possibly do some tagging and analysis already, but there are also benefits of other possible approaches. We are hoping to setting on the best method in the next month or so. |
Can you give me a bit more info on what type of tests would be appropriate? Do you have an example of similar tests? Since this is preliminary/exploratory I was hoping the manual tests I provided would be enough. Do you think this is reasonable? |
Give we are adding something new to the public API we should have the full suite of testing for this, integration/end-to-end/unit tests/etc. Snapshot tests with the test agent are a great building block for writing tests for integrations. If you want to instead make it be Just an option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Needs tests and a release note.
I will most likely be closing this in favor of [WIP] add ray tracing hook with ddtrace.auto import |
Overview
This PR adds a tracing startup hook for the Ray ML Framework as described in MLOB-3238, which lays the groundwork for the Ray distributed AI training observability MVP.
The tracing hook can be passed to the ray start command via
PYTHONPATH=path/to/ddtrace ray start --head --tracing-startup-hook=ddtrace.aiobs.integrations.ray.tracer:setup_tracing
For example:
PYTHONPATH=/Users/imran.hendley/.pyenv/versions/3.12.8/envs/ray/lib/python3.12/site-packages ray start --head --tracing-startup-hook=ddtrace.aiobs.integrations.ray.tracer:setup_tracing
This instructs Ray to export Opentelemetry traces to Datadog agent via the startup hook in dd-trace-py. Those traces will in turn be streamed to the tracing API endpoint configured in the Datadog agent settings.
Motivation
As described in Distributed AI Observability Proposal we would like to provide comprehensive real-time monitoring and root cause analysis (RCA) for distributed training workloads to our customers. Specifically, for the 2025 Q3 MVP we would like to be able to collect and report training jobs and the trace, logs, and metrics associated with each job, and we would like report the status of each job including whether it was a success or failure. This PR provides initial functionality for exporting tracing via a Ray startup tracing hook. We will explore expanding this approach further to cover the remaining scope for the MVP backend.
Test plan
Build and install ddtrace with these changes. Before this PR lands and rolls out in a new release of ddtrace this could be achieved via the following steps:
Modify your Datadog Agent settings to point to staging adding your own Staging API keys in place of
<staging_api_key>
, but leave your existingrun_path
line and everything below at the bottom if you have it.Download the hello.py script linked in MLOB-2922.
Set the time window to "Live Past 15 minutes" and search for "service:my-ray-job" in the Staging APM > Traces > Explorer search box at https://dd.datad0g.com/apm/traces and observe the list of traces captured from the toy training job kicked off in hello.py.
To test the fallback job name run:
Then set the time window to "Live Past 15 minutes" and search for service:unspecified-ray-job in the Staging traces explorer search box.
Risk assessment
This is a low-risk change because we are adding a feature which must be invoked manually at this point and does not run by default.
Release Notes
This change lays the groundwork for an unrelease feature and does not affect public facing APIs, so no release notes are needed.
Documentation
Distributed AI Observability Proposal
[RFC] Distributed AI Observability
Checklist
Reviewer Checklist