-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to register ADP as a remote agent to the Datadog Agent #377
Conversation
Regression Detector (DogStatsD)Regression Detector ResultsRun ID: 43bc1497-f241-457d-8e91-bbea594d0709 Baseline: 7.62.0-rc.2 Optimization Goals: ✅ No significant changes detected
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | dsd_uds_100mb_3k_contexts_distributions_only | memory utilization | +2.37 | [+2.18, +2.56] | 1 | |
➖ | dsd_uds_100mb_3k_contexts | ingress throughput | +0.00 | [-0.04, +0.05] | 1 | |
➖ | dsd_uds_40mb_12k_contexts_40_senders | ingress throughput | +0.00 | [-0.00, +0.00] | 1 | |
➖ | dsd_uds_500mb_3k_contexts | ingress throughput | +0.00 | [-0.01, +0.01] | 1 | |
➖ | dsd_uds_1mb_3k_contexts_dualship | ingress throughput | +0.00 | [-0.00, +0.00] | 1 | |
➖ | dsd_uds_512kb_3k_contexts | ingress throughput | +0.00 | [-0.01, +0.01] | 1 | |
➖ | dsd_uds_1mb_50k_contexts_memlimit | ingress throughput | +0.00 | [-0.00, +0.00] | 1 | |
➖ | dsd_uds_10mb_3k_contexts | ingress throughput | +0.00 | [-0.00, +0.00] | 1 | |
➖ | dsd_uds_1mb_50k_contexts | ingress throughput | +0.00 | [-0.00, +0.00] | 1 | |
➖ | dsd_uds_1mb_3k_contexts | ingress throughput | -0.00 | [-0.00, +0.00] | 1 | |
➖ | dsd_uds_100mb_250k_contexts | ingress throughput | -0.00 | [-0.00, +0.00] | 1 | |
➖ | quality_gates_idle_rss | memory utilization | -0.57 | [-0.66, -0.49] | 1 |
Bounds Checks: ❌ Failed
perf | experiment | bounds_check_name | replicates_passed | links |
---|---|---|---|---|
❌ | quality_gates_idle_rss | memory_usage | 0/10 |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Regression Detector LinksExperiment Result Links
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think the component-based approach is a reasonable one.
I started leaving some feedback to better shape the design even though I realize it's still very much a rough draft.
lib/saluki-components/src/destinations/datadog_status_flare/mod.rs
Outdated
Show resolved
Hide resolved
lib/saluki-components/src/destinations/datadog_status_flare/mod.rs
Outdated
Show resolved
Hide resolved
lib/saluki-components/src/destinations/datadog_status_flare/mod.rs
Outdated
Show resolved
Hide resolved
lib/saluki-components/src/destinations/datadog_status_flare/mod.rs
Outdated
Show resolved
Hide resolved
#[serde(default = "default_api_endpoint", rename = "remote_agent_api_endpoint")] | ||
api_endpoint: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this be api_port
, since we don't ever want to expose a port on anything other than localhost
.
The default port value should also be 5102, which puts it sequentially with the other two ADP ports, 5100 (admin API) and 5101 (telemetry API).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question:
We would use localhost
in the case that the Agent and ADP runs in the same machine. Would there be instances in which ADP runs on another machine? Maybe containers?
Mostly, asking to understand the possible deployment topologies 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For containerized environments, everything is currently on the same machine: standalone Docker, Kubernetes, etc.
If ADP ever became a mechanism for doing remote aggregation (pool of ADP workers running on a different node than the application send metrics/logs/traces/etc) then this might matter... but I can't really think of a situation where ADP will ever really be decoupled from a local Agent.
lib/saluki-components/src/destinations/datadog/status_flare/mod.rs
Outdated
Show resolved
Hide resolved
lib/saluki-components/src/destinations/datadog/status_flare/mod.rs
Outdated
Show resolved
Hide resolved
lib/saluki-components/src/destinations/datadog/status_flare/mod.rs
Outdated
Show resolved
Hide resolved
lib/saluki-components/src/destinations/datadog/status_flare/mod.rs
Outdated
Show resolved
Hide resolved
lib/saluki-components/src/destinations/datadog/status_flare/mod.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes are looking good.
The thing I remembered, though, is that we need a way to disable this for when we aren't actually running with the Agent, such as in CI. I think we'd basically want an if block in create_topology
or something... but that gets tricky because we depend on this component now for being ale to always add the internal metrics source. Hmm....
lib/saluki-components/src/destinations/datadog/status_flare/mod.rs
Outdated
Show resolved
Hide resolved
|
||
// Time to (re)register with the Core Agent. | ||
_ = register_agent.tick() => { | ||
match client.register_remote_agent_request(&id, &display_name, &api_endpoint, &auth_token).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Purely as a sidenote that we should add a TODO for, but isn't blocking for this PR: this call might take a while if the Agent is temporarily down or slow to respond... which would in turn slow down the accepting of events/responding to health check requests.
I don't know the exact code I would want to use to do it, but we would likely want to consider figuring out how we could spawn this call as a background task so that the component can keep polling, but limit ourselves to one in-flight request at a time.
Typing this all up also reminds me that we don't have any request timeout configuration in RemoteAgentClient
. 🤔
Signed-off-by: Raymond Zhao <[email protected]>
Signed-off-by: Raymond Zhao <[email protected]>
Signed-off-by: Raymond Zhao <[email protected]>
Signed-off-by: Raymond Zhao <[email protected]>
Signed-off-by: Raymond Zhao <[email protected]>
// Time to (re)register with the Core Agent. | ||
// | ||
// TODO: Consider spawning the registration as a task so that the component can keep polling and not slow down the accepting of events and responding of health checks. | ||
_ = register_agent.tick() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This component is only built if adp_standalone_mode == false
, but this call will keep failing if
remote_agent_registry:
enabled: true
is not set on the datadog-agent
side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷🏻
I think that's just something we'll have to deal with until we make it the default to enable the remote agent registry. We don't get any visible errors, right, since it's just debug logging if the call fails? Or does it have any other user-visible impact in ADP when the remote agent registry is disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, it will just spam the debug logs if it fails.
// Time to (re)register with the Core Agent. | ||
// | ||
// TODO: Consider spawning the registration as a task so that the component can keep polling and not slow down the accepting of events and responding of health checks. | ||
_ = register_agent.tick() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷🏻
I think that's just something we'll have to deal with until we make it the default to enable the remote agent registry. We don't get any visible errors, right, since it's just debug logging if the call fails? Or does it have any other user-visible impact in ADP when the remote agent registry is disabled?
/// | ||
/// ## Missing | ||
/// | ||
/// - grpc server to respond to Core Agent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say this isn't missing anymore. :)
lib/saluki-components/src/destinations/datadog/status_flare/mod.rs
Outdated
Show resolved
Hide resolved
lib/saluki-components/src/destinations/datadog/status_flare/mod.rs
Outdated
Show resolved
Hide resolved
bin/agent-data-plane/src/main.rs
Outdated
|
||
// Build our administrative API server. | ||
let primary_api_listen_address = configuration | ||
.try_get_typed("api_listen_address") | ||
.error_context("Failed to get API listen address.")? | ||
.unwrap_or_else(|| ListenAddress::Tcp(([0, 0, 0, 0], 5100).into())); | ||
|
||
let remote_agent_server = new_remote_agent_server()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this, but it also doesn't matter right now because it's basically a no-op anyways.
This will matter more, though, when we actually need to wire up the gRPC server implementation to the data collected by the component, at which point we'll need to do something more like grab this from the component configuration instead, so that we can build it pre-wired to the component.
Again, not a concern for this PR, but we'll have to do something else for sure when actually adding the logic.
lib/saluki-app/src/api.rs
Outdated
|
||
let multiplexed_service = TowerToHyperService::new(MultiplexService::new( | ||
http_service, | ||
self.grpc_router.unwrap().into_router(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't unwrap here. Maybe just .unwrap_or_else(|| Server::builder())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I believe Server::builder()
needs a .add_service(svc)
call on it to actually make it return a Router
which we would then use with the multiplexed service. Otherwise it just returns a Server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, doh, that's true.
Looking at the docs, I think we might want to change grpc_router
to be RoutesBuilder
: this would let us initialize the field with RoutesBuilder::default()
, add multiple gRPC services via RoutesBuilder::add_service
, and then in serve
, we could consume it and get a Routes
by calling RoutesBuilder::routes
, which we would then convert into the proper router type by calling Routes::into_axum_router
.
Signed-off-by: Raymond Zhao <[email protected]>
Signed-off-by: Raymond Zhao <[email protected]>
Signed-off-by: Raymond Zhao <[email protected]>
Signed-off-by: Raymond Zhao <[email protected]>
) # Context We would like to be able to register ADP as a remote agent to the Core Agent so that status and flares can be requested. # Solution This pr creates a new destination component that is intended for both status and flare information. # Notes This pr registers ADP as a remote agent but a following pr is needed for actually sending data over. # Testing Run ADP with an agent that has the remote agent registry enabled. Run the `status` command on the agent. You should see ADP listed in the `Remote Agents` section. --------- Signed-off-by: Raymond Zhao <[email protected]>
Context
We would like to be able to register ADP as a remote agent to the Core Agent so that status and flares can be requested.
Solution
This pr creates a new destination component that is intended for both status and flare information.
Notes
This pr registers ADP as a remote agent but a following pr is needed for actually sending data over.
Testing
Run ADP with an agent that has the remote agent registry enabled.
Run the
status
command on the agent. You should see ADP listed in theRemote Agents
section.