Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to register ADP as a remote agent to the Datadog Agent #377

Merged
merged 32 commits into from
Jan 27, 2025

Conversation

rayz
Copy link
Contributor

@rayz rayz commented Dec 9, 2024

Context

We would like to be able to register ADP as a remote agent to the Core Agent so that status and flares can be requested.

Solution

This pr creates a new destination component that is intended for both status and flare information.

Notes

This pr registers ADP as a remote agent but a following pr is needed for actually sending data over.

Testing

Run ADP with an agent that has the remote agent registry enabled.
Run the status command on the agent. You should see ADP listed in the Remote Agents section.

@github-actions github-actions bot added area/config Configuration. area/components Sources, transforms, and destinations. labels Dec 9, 2024
@pr-commenter
Copy link

pr-commenter bot commented Dec 9, 2024

Regression Detector (DogStatsD)

Regression Detector Results

Run ID: 43bc1497-f241-457d-8e91-bbea594d0709

Baseline: 7.62.0-rc.2
Comparison: 7.62.0-rc.2

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
dsd_uds_100mb_3k_contexts_distributions_only memory utilization +2.37 [+2.18, +2.56] 1
dsd_uds_100mb_3k_contexts ingress throughput +0.00 [-0.04, +0.05] 1
dsd_uds_40mb_12k_contexts_40_senders ingress throughput +0.00 [-0.00, +0.00] 1
dsd_uds_500mb_3k_contexts ingress throughput +0.00 [-0.01, +0.01] 1
dsd_uds_1mb_3k_contexts_dualship ingress throughput +0.00 [-0.00, +0.00] 1
dsd_uds_512kb_3k_contexts ingress throughput +0.00 [-0.01, +0.01] 1
dsd_uds_1mb_50k_contexts_memlimit ingress throughput +0.00 [-0.00, +0.00] 1
dsd_uds_10mb_3k_contexts ingress throughput +0.00 [-0.00, +0.00] 1
dsd_uds_1mb_50k_contexts ingress throughput +0.00 [-0.00, +0.00] 1
dsd_uds_1mb_3k_contexts ingress throughput -0.00 [-0.00, +0.00] 1
dsd_uds_100mb_250k_contexts ingress throughput -0.00 [-0.00, +0.00] 1
quality_gates_idle_rss memory utilization -0.57 [-0.66, -0.49] 1

Bounds Checks: ❌ Failed

perf experiment bounds_check_name replicates_passed links
quality_gates_idle_rss memory_usage 0/10

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

@github-actions github-actions bot added the area/io General I/O and networking. label Dec 11, 2024
@pr-commenter
Copy link

pr-commenter bot commented Dec 11, 2024

Regression Detector Links

Experiment Result Links

experiment link(s)
dsd_uds_100mb_250k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_100mb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_100mb_3k_contexts_distributions_only [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_10mb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_3k_contexts_dualship [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_50k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_1mb_50k_contexts_memlimit [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_40mb_12k_contexts_40_senders [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_500mb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_512kb_3k_contexts [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
quality_gates_idle_rss [Profiling (ADP)] [Profiling (DSD)] [SMP Dashboard]
dsd_uds_50mb_10k_contexts_no_inlining (ADP only) [Profiling (ADP)] [SMP Dashboard]
dsd_uds_50mb_10k_contexts_no_inlining_no_allocs (ADP only) [Profiling (ADP)] [SMP Dashboard]

Copy link
Member

@tobz tobz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think the component-based approach is a reasonable one.

I started leaving some feedback to better shape the design even though I realize it's still very much a rough draft.

bin/agent-data-plane/src/main.rs Outdated Show resolved Hide resolved
Comment on lines 35 to 36
#[serde(default = "default_api_endpoint", rename = "remote_agent_api_endpoint")]
api_endpoint: String,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this be api_port, since we don't ever want to expose a port on anything other than localhost.

The default port value should also be 5102, which puts it sequentially with the other two ADP ports, 5100 (admin API) and 5101 (telemetry API).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question:
We would use localhost in the case that the Agent and ADP runs in the same machine. Would there be instances in which ADP runs on another machine? Maybe containers?

Mostly, asking to understand the possible deployment topologies 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For containerized environments, everything is currently on the same machine: standalone Docker, Kubernetes, etc.

If ADP ever became a mechanism for doing remote aggregation (pool of ADP workers running on a different node than the application send metrics/logs/traces/etc) then this might matter... but I can't really think of a situation where ADP will ever really be decoupled from a local Agent.

@rayz rayz marked this pull request as ready for review December 12, 2024 16:08
@rayz rayz requested review from a team as code owners December 12, 2024 16:08
@github-actions github-actions bot added the destination/datadog Common Datadog destination code. label Dec 13, 2024
Copy link
Member

@tobz tobz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are looking good.

The thing I remembered, though, is that we need a way to disable this for when we aren't actually running with the Agent, such as in CI. I think we'd basically want an if block in create_topology or something... but that gets tricky because we depend on this component now for being ale to always add the internal metrics source. Hmm....

bin/agent-data-plane/src/main.rs Outdated Show resolved Hide resolved

// Time to (re)register with the Core Agent.
_ = register_agent.tick() => {
match client.register_remote_agent_request(&id, &display_name, &api_endpoint, &auth_token).await {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Purely as a sidenote that we should add a TODO for, but isn't blocking for this PR: this call might take a while if the Agent is temporarily down or slow to respond... which would in turn slow down the accepting of events/responding to health check requests.

I don't know the exact code I would want to use to do it, but we would likely want to consider figuring out how we could spawn this call as a background task so that the component can keep polling, but limit ourselves to one in-flight request at a time.

Typing this all up also reminds me that we don't have any request timeout configuration in RemoteAgentClient. 🤔

bin/agent-data-plane/src/main.rs Outdated Show resolved Hide resolved
bin/agent-data-plane/src/main.rs Outdated Show resolved Hide resolved
@github-actions github-actions bot added the area/observability Internal observability of ADP and Saluki. label Dec 17, 2024
// Time to (re)register with the Core Agent.
//
// TODO: Consider spawning the registration as a task so that the component can keep polling and not slow down the accepting of events and responding of health checks.
_ = register_agent.tick() => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This component is only built if adp_standalone_mode == false, but this call will keep failing if

 remote_agent_registry:
   enabled: true

is not set on the datadog-agent side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷🏻

I think that's just something we'll have to deal with until we make it the default to enable the remote agent registry. We don't get any visible errors, right, since it's just debug logging if the call fails? Or does it have any other user-visible impact in ADP when the remote agent registry is disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it will just spam the debug logs if it fails.

// Time to (re)register with the Core Agent.
//
// TODO: Consider spawning the registration as a task so that the component can keep polling and not slow down the accepting of events and responding of health checks.
_ = register_agent.tick() => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷🏻

I think that's just something we'll have to deal with until we make it the default to enable the remote agent registry. We don't get any visible errors, right, since it's just debug logging if the call fails? Or does it have any other user-visible impact in ADP when the remote agent registry is disabled?

///
/// ## Missing
///
/// - grpc server to respond to Core Agent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say this isn't missing anymore. :)


// Build our administrative API server.
let primary_api_listen_address = configuration
.try_get_typed("api_listen_address")
.error_context("Failed to get API listen address.")?
.unwrap_or_else(|| ListenAddress::Tcp(([0, 0, 0, 0], 5100).into()));

let remote_agent_server = new_remote_agent_server()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this, but it also doesn't matter right now because it's basically a no-op anyways.

This will matter more, though, when we actually need to wire up the gRPC server implementation to the data collected by the component, at which point we'll need to do something more like grab this from the component configuration instead, so that we can build it pre-wired to the component.

Again, not a concern for this PR, but we'll have to do something else for sure when actually adding the logic.

lib/saluki-app/src/api.rs Outdated Show resolved Hide resolved
lib/saluki-app/src/api.rs Outdated Show resolved Hide resolved

let multiplexed_service = TowerToHyperService::new(MultiplexService::new(
http_service,
self.grpc_router.unwrap().into_router(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't unwrap here. Maybe just .unwrap_or_else(|| Server::builder())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I believe Server::builder() needs a .add_service(svc) call on it to actually make it return a Router which we would then use with the multiplexed service. Otherwise it just returns a Server

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, doh, that's true.

Looking at the docs, I think we might want to change grpc_router to be RoutesBuilder: this would let us initialize the field with RoutesBuilder::default(), add multiple gRPC services via RoutesBuilder::add_service, and then in serve, we could consume it and get a Routes by calling RoutesBuilder::routes, which we would then convert into the proper router type by calling Routes::into_axum_router.

rayz added 4 commits January 23, 2025 20:34
Signed-off-by: Raymond Zhao <[email protected]>
Signed-off-by: Raymond Zhao <[email protected]>
Signed-off-by: Raymond Zhao <[email protected]>
Signed-off-by: Raymond Zhao <[email protected]>
@rayz rayz merged commit d990462 into main Jan 27, 2025
16 checks passed
tobz pushed a commit that referenced this pull request Jan 28, 2025
)

# Context

We would like to be able to register ADP as a remote agent to the Core
Agent so that status and flares can be requested.

# Solution

This pr creates a new destination component that is intended for both
status and flare information.

# Notes

This pr registers ADP as a remote agent but a following pr is needed for
actually sending data over.

# Testing

Run ADP with an agent that has the remote agent registry enabled. 
Run the `status` command on the agent. You should see ADP listed in the
`Remote Agents` section.

---------

Signed-off-by: Raymond Zhao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/components Sources, transforms, and destinations. area/config Configuration. area/io General I/O and networking. area/observability Internal observability of ADP and Saluki. destination/datadog Common Datadog destination code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants