Skip to content

feat: Lazily resolve api key #717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

lym953
Copy link
Contributor

@lym953 lym953 commented Jun 24, 2025

Motivation

From @astuyve:

today we basically block/await on that decrypt call before we can call /next
so if we can instead make that async and then resolve the future only when we need to flush data, that can be a big win for many customers.

https://datadoghq.atlassian.net/browse/SVLS-6995

Previous work

DataDog/serverless-components#21, DataDog/serverless-components#24 created ApiKeyFactory, which is a util to enable lazy API key resolution.

This PR

Updates Bottlecap code to use ApiKeyFactory to lazily resolve API key, i.e. instead of resolving it by querying Secret Manager or KMS during init phase, do it at flushing time when api key is actually needed.

Note

This PR changes the behavior when key resolution fails, i.e. when resolve_secrets() returns None.

  • Before: run extension_loop_idle(), which does not stop the runtime
  • After: panic, which will stop the runtime (if I understand correctly). Of course it's not ideal. Any better idea?
    • It's harder now to run extension_loop_idle() because api key resolution code is not in the main loop anymore, but in various consumer code of api key
    • Is there a way to gracefully shut down the extension without affecting the runtime?

Update: Added a PR to address resolution failure: #732
These two PRs should be merged together. Keeping them separate PRs just to make review easier.

Testing

Setup

  • Runtime: Go1 on Amazon Linux 2
  • Architecture: arm64
  • An app with empty implementation code

Result

Below is the Datadog Next-Gen Extension ready in: time logged.

  • Before: (prod extension arn:aws:lambda:us-east-1:464622532012:layer:Datadog-Extension-ARM:82)

    • 88.6 ± 1.8 (ms)
  • After: (test extension arn:aws:lambda:us-east-1:425362996713:layer:Datadog-Bottlecap-Beta-ARM-yiming:2)

    • 35.4 ± 5.1 (ms)
    • (-60.0%)
image

Both use 5 samples.

Notes

https://datadoghq.atlassian.net/issues/SVLS-6996
https://datadoghq.atlassian.net/issues/SVLS-6998

@lym953 lym953 force-pushed the yiming.luo/separate-aws-creds branch from f749307 to 11c1c77 Compare June 27, 2025 17:35
lym953 added a commit that referenced this pull request Jul 2, 2025
# Problem
Right now `AwsConfig` has a lot of fields, including the ones related to
credential:
```
    pub aws_access_key_id: String,
    pub aws_secret_access_key: String,
    pub aws_session_token: String,
    pub aws_container_credentials_full_uri: String,
    pub aws_container_authorization_token: String,
```

The next PR #717
wants to lazily load API key and the credentials. To do that, for the
resolver function `resolve_secrets()`, I need to change the param
`aws_config` from `&AwsConfig` to `Arc<RwLock<AwsConfig>>`. Because
`aws_config` is passed to many places, this change involves updating
lots of functions, which is formidable.

# This PR
Separates these credential-related fields out from `AwsConfig` and
creates a new struct `AwsCredentials`

Thus, the next PR will only need to change the param `aws_credentials`
from `&AwsCredentials` to `Arc<RwLock<AwsCredentials>>`. Because
`aws_credentials` is not used in lots of places, the next PR becomes
easier.

https://datadoghq.atlassian.net/issues/SVLS-6996
https://datadoghq.atlassian.net/issues/SVLS-6998
Base automatically changed from yiming.luo/separate-aws-creds to main July 2, 2025 21:11
@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from fdd93a4 to 4ac16ad Compare July 2, 2025 21:31
@lym953 lym953 requested a review from Copilot July 3, 2025 15:52
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates ApiKeyFactory across Bottlecap to defer DD-API-KEY resolution until flush/send time, reducing init latency. Key changes include:

  • Replace direct API key strings with Arc<ApiKeyFactory> in all flushers, agents, and tests
  • Refactor trace/stat flusher and log flusher to initialize endpoints and headers lazily via OnceCell
  • Update resolve_secrets to use an async RwLock for AWS credentials and adjust related helper signatures

Reviewed Changes

Copilot reviewed 13 out of 15 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/metrics_integration_test.rs Switched FlusherConfig to use ApiKeyFactory
tests/logs_integration_test.rs Switched LogsFlusher instantiation to use factory
src/traces/trace_processor.rs Made process_traces async and use ApiKeyFactory
src/traces/trace_agent.rs Replaced stored key with factory; await per request
src/traces/stats_flusher.rs Swapped in factory and lazily build Endpoint
src/secrets/decrypt.rs Converted credentials to Arc<RwLock<_>> and updated calls
src/proxy/mod.rs Changed should_start_proxy to take Arc<AwsConfig>
src/proxy/interceptor.rs Updated interceptor to use Arc<AwsConfig>
src/otlp/agent.rs Updated OTLP agent to await process_traces
src/logs/flusher.rs Introduced factory plus lazy HeaderMap caching
src/lifecycle/invocation/span_inferrer.rs Updated tests to pass Arc<AwsConfig>
src/lifecycle/invocation/processor.rs Refactored processor to use Arc<AwsConfig>
Cargo.toml Bumped dogstatsd revision

Comment on lines +40 to 42
endpoint: OnceCell<Endpoint>,
}

Copy link
Preview

Copilot AI Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching the Endpoint with an initial API key may lead to stale keys if ApiKeyFactory rotates or refreshes credentials. Consider regenerating or invalidating the cell when the underlying key changes.

Suggested change
endpoint: OnceCell<Endpoint>,
}
}
impl ServerlessStatsFlusher {
async fn construct_endpoint(&self) -> Endpoint {
let api_key = self.api_key_factory.get_api_key().await.to_string();
let stats_url = trace_stats_url(&self.config.site);
Endpoint {
url: hyper::Uri::from_str(&stats_url)
.expect("can't make URI from stats url, exiting"),
api_key: Some(api_key.clone().into()),
timeout_ms: self.config.flush_timeout * 1_000,
test_token: None,
}
}
}

Copilot uses AI. Check for mistakes.

@@ -26,42 +28,24 @@ pub struct Flusher {
client: reqwest::Client,
aggregator: Arc<Mutex<Aggregator>>,
config: Arc<config::Config>,
headers: HeaderMap,
api_key_factory: Arc<ApiKeyFactory>,
headers: OnceCell<HeaderMap>,
Copy link
Preview

Copilot AI Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing headers including DD-API-KEY in a OnceCell will cache the first resolved key indefinitely. If the key can change at runtime, you might emit outdated headers; consider refreshing per flush or using a time-to-live.

Suggested change
headers: OnceCell<HeaderMap>,
headers: Arc<Mutex<HeaderMap>>,

Copilot uses AI. Check for mistakes.

@@ -323,7 +329,7 @@ mod tests {
};

let received_payload =
if let TracerPayloadCollection::V07(payload) = tracer_payload.get_payloads() {
if let TracerPayloadCollection::V07(payload) = tracer_payload.await.get_payloads() {
Copy link
Preview

Copilot AI Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Awaiting the future inline in the match expression can reduce readability. It may help to .await process_traces into a local send_data variable first, then call get_payloads() on it.

Copilot uses AI. Check for mistakes.

Comment on lines 333 to 351
let aws_config = Arc::new(aws_config);
let aws_credentials = Arc::new(RwLock::new(aws_credentials));
let api_key_factory = {
let config = Arc::clone(&config);
let aws_config = Arc::clone(&aws_config);
let aws_credentials = Arc::clone(&aws_credentials);

Arc::new(ApiKeyFactory::new_from_resolver(Arc::new(move || {
let config = Arc::clone(&config);
let aws_config = Arc::clone(&aws_config);
let aws_credentials = Arc::clone(&aws_credentials);

Box::pin(async move {
resolve_secrets(config, aws_config, aws_credentials)
.await
.unwrap_or_else(|| {
error!("Failed to resolve API key");
String::new()
})
})
})))
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During INIT phase, instead of resolving API key, just initialize an API key factory.

@@ -415,11 +430,11 @@ async fn extension_loop_idle(client: &Client, r: &RegisterResponse) -> Result<()

#[allow(clippy::too_many_lines)]
async fn extension_loop_active(
aws_config: &AwsConfig,
aws_config: Arc<AwsConfig>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing it to Arc so it can be passed to ApiKeyFactory and shared across threads.

@@ -26,42 +28,24 @@ pub struct Flusher {
client: reqwest::Client,
aggregator: Arc<Mutex<Aggregator>>,
config: Arc<config::Config>,
headers: HeaderMap,
api_key_factory: Arc<ApiKeyFactory>,
headers: OnceCell<HeaderMap>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazily initialize headers, which includes api key.

aws_config: &AwsConfig,
aws_credentials: &mut AwsCredentials,
aws_config: Arc<AwsConfig>,
aws_credentials: Arc<RwLock<AwsCredentials>>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A core change: added RwLock

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah and we need an RwLock here because the factory will lazily write/update this struct member?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The factory fills in aws_access_key_id, aws_secret_access_key and aws_session_token if we are in snap start.

@@ -34,36 +36,47 @@ pub struct ServerlessStatsFlusher {
// pub buffer: Arc<Mutex<Vec<pb::ClientStatsPayload>>>,
aggregator: Arc<Mutex<StatsAggregator>>,
config: Arc<config::Config>,
endpoint: Endpoint,
api_key_factory: Arc<ApiKeyFactory>,
endpoint: OnceCell<Endpoint>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazily resolve endpoint, which contains api key.

Box::pin(async move {
resolve_secrets(config, aws_config, aws_credentials)
.await
.expect("Failed to resolve API key")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any better way to handle this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm ideally, if failing on resolve you'd enter the noop loop, what would happen if you added that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will panic at flush time and stop the runtime, so I wonder if there's a way for the extension to stop gracefully at that time without stopping the runtime.

@lym953 lym953 marked this pull request as ready for review July 3, 2025 17:09
@lym953 lym953 requested a review from a team as a code owner July 3, 2025 17:09
@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from d623d67 to 3724235 Compare July 7, 2025 17:02
Copy link
Contributor

@astuyve astuyve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Yiming! Let's get this onto self monitoring to test for a few days

@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from cec3247 to 727d04f Compare July 9, 2025 15:47
Comment on lines 335 to 351
let api_key_factory = {
let config = Arc::clone(&config);
let aws_config = Arc::clone(&aws_config);
let aws_credentials = Arc::clone(&aws_credentials);

Arc::new(ApiKeyFactory::new_from_resolver(Arc::new(move || {
let config = Arc::clone(&config);
let aws_config = Arc::clone(&aws_config);
let aws_credentials = Arc::clone(&aws_credentials);

Box::pin(async move {
resolve_secrets(config, aws_config, aws_credentials)
.await
.expect("Failed to resolve API key")
})
})))
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we could make this a method here on in the api key resolver, really don't like the pattern of having the declaration of the same variable 3 times just because we're nesting it, given how main code has increasingly grown, it would be good to hide this in some way and document it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Extracted a function create_api_key_factory()

Copy link
Contributor

@duncanista duncanista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR @lym953 !

@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from 3be2928 to e820be0 Compare July 10, 2025 20:18
@lym953 lym953 changed the base branch from main to jordan.gonzalez/trace-agent/aggregation-for-proxy July 10, 2025 20:19
@duncanista duncanista force-pushed the jordan.gonzalez/trace-agent/aggregation-for-proxy branch from 4c1162d to 07aa341 Compare July 14, 2025 19:11
Base automatically changed from jordan.gonzalez/trace-agent/aggregation-for-proxy to main July 16, 2025 19:47
lym953 added a commit that referenced this pull request Jul 16, 2025
…745)

# Background
Right now `SendData` is passed around across channels.

# This PR

Instead of passing `SendData`, pass `SendDataBuilderInfo`, which bundles
`SendDataBuilder` and payload size. Just before flush, call
`SendDataBuilder.build()` to build `SendData`.

# Motivation
DataDog/libdatadog#1140 (comment)
It is suggested that the function `set_api_key()` shouldn't be added on
`SendData`, but should be added on `SendDataBuilder`. Because need to
call `set_api_key()` just before flush, we need to make sure the object
is `SendDataBuilder` instead of `SendData` until flush time.

And because we need payload size in Trace Aggregator, and
`SendDataBuilder` doesn't expose this field, we need to pass it
explicitly along with `SendDataBuilder`.

# Next steps
Update #717
#732 so that
`get_api_key()` is called just before flush.

# Dependency
DataDog/libdatadog#1140
@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from 2ef2a9a to 37caca4 Compare July 16, 2025 20:51
@lym953 lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from 37caca4 to ef63759 Compare July 16, 2025 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants