-
Notifications
You must be signed in to change notification settings - Fork 13
feat: Lazily resolve api key #717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
f749307
to
11c1c77
Compare
# Problem Right now `AwsConfig` has a lot of fields, including the ones related to credential: ``` pub aws_access_key_id: String, pub aws_secret_access_key: String, pub aws_session_token: String, pub aws_container_credentials_full_uri: String, pub aws_container_authorization_token: String, ``` The next PR #717 wants to lazily load API key and the credentials. To do that, for the resolver function `resolve_secrets()`, I need to change the param `aws_config` from `&AwsConfig` to `Arc<RwLock<AwsConfig>>`. Because `aws_config` is passed to many places, this change involves updating lots of functions, which is formidable. # This PR Separates these credential-related fields out from `AwsConfig` and creates a new struct `AwsCredentials` Thus, the next PR will only need to change the param `aws_credentials` from `&AwsCredentials` to `Arc<RwLock<AwsCredentials>>`. Because `aws_credentials` is not used in lots of places, the next PR becomes easier. https://datadoghq.atlassian.net/issues/SVLS-6996 https://datadoghq.atlassian.net/issues/SVLS-6998
fdd93a4
to
4ac16ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR integrates ApiKeyFactory
across Bottlecap to defer DD-API-KEY resolution until flush/send time, reducing init latency. Key changes include:
- Replace direct API key strings with
Arc<ApiKeyFactory>
in all flushers, agents, and tests - Refactor trace/stat flusher and log flusher to initialize endpoints and headers lazily via
OnceCell
- Update
resolve_secrets
to use an asyncRwLock
for AWS credentials and adjust related helper signatures
Reviewed Changes
Copilot reviewed 13 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
tests/metrics_integration_test.rs | Switched FlusherConfig to use ApiKeyFactory |
tests/logs_integration_test.rs | Switched LogsFlusher instantiation to use factory |
src/traces/trace_processor.rs | Made process_traces async and use ApiKeyFactory |
src/traces/trace_agent.rs | Replaced stored key with factory; await per request |
src/traces/stats_flusher.rs | Swapped in factory and lazily build Endpoint |
src/secrets/decrypt.rs | Converted credentials to Arc<RwLock<_>> and updated calls |
src/proxy/mod.rs | Changed should_start_proxy to take Arc<AwsConfig> |
src/proxy/interceptor.rs | Updated interceptor to use Arc<AwsConfig> |
src/otlp/agent.rs | Updated OTLP agent to await process_traces |
src/logs/flusher.rs | Introduced factory plus lazy HeaderMap caching |
src/lifecycle/invocation/span_inferrer.rs | Updated tests to pass Arc<AwsConfig> |
src/lifecycle/invocation/processor.rs | Refactored processor to use Arc<AwsConfig> |
Cargo.toml | Bumped dogstatsd revision |
endpoint: OnceCell<Endpoint>, | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caching the Endpoint
with an initial API key may lead to stale keys if ApiKeyFactory
rotates or refreshes credentials. Consider regenerating or invalidating the cell when the underlying key changes.
endpoint: OnceCell<Endpoint>, | |
} | |
} | |
impl ServerlessStatsFlusher { | |
async fn construct_endpoint(&self) -> Endpoint { | |
let api_key = self.api_key_factory.get_api_key().await.to_string(); | |
let stats_url = trace_stats_url(&self.config.site); | |
Endpoint { | |
url: hyper::Uri::from_str(&stats_url) | |
.expect("can't make URI from stats url, exiting"), | |
api_key: Some(api_key.clone().into()), | |
timeout_ms: self.config.flush_timeout * 1_000, | |
test_token: None, | |
} | |
} | |
} |
Copilot uses AI. Check for mistakes.
@@ -26,42 +28,24 @@ pub struct Flusher { | |||
client: reqwest::Client, | |||
aggregator: Arc<Mutex<Aggregator>>, | |||
config: Arc<config::Config>, | |||
headers: HeaderMap, | |||
api_key_factory: Arc<ApiKeyFactory>, | |||
headers: OnceCell<HeaderMap>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing headers including DD-API-KEY
in a OnceCell
will cache the first resolved key indefinitely. If the key can change at runtime, you might emit outdated headers; consider refreshing per flush or using a time-to-live.
headers: OnceCell<HeaderMap>, | |
headers: Arc<Mutex<HeaderMap>>, |
Copilot uses AI. Check for mistakes.
@@ -323,7 +329,7 @@ mod tests { | |||
}; | |||
|
|||
let received_payload = | |||
if let TracerPayloadCollection::V07(payload) = tracer_payload.get_payloads() { | |||
if let TracerPayloadCollection::V07(payload) = tracer_payload.await.get_payloads() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Awaiting the future inline in the match expression can reduce readability. It may help to .await
process_traces
into a local send_data
variable first, then call get_payloads()
on it.
Copilot uses AI. Check for mistakes.
bottlecap/src/bin/bottlecap/main.rs
Outdated
let aws_config = Arc::new(aws_config); | ||
let aws_credentials = Arc::new(RwLock::new(aws_credentials)); | ||
let api_key_factory = { | ||
let config = Arc::clone(&config); | ||
let aws_config = Arc::clone(&aws_config); | ||
let aws_credentials = Arc::clone(&aws_credentials); | ||
|
||
Arc::new(ApiKeyFactory::new_from_resolver(Arc::new(move || { | ||
let config = Arc::clone(&config); | ||
let aws_config = Arc::clone(&aws_config); | ||
let aws_credentials = Arc::clone(&aws_credentials); | ||
|
||
Box::pin(async move { | ||
resolve_secrets(config, aws_config, aws_credentials) | ||
.await | ||
.unwrap_or_else(|| { | ||
error!("Failed to resolve API key"); | ||
String::new() | ||
}) | ||
}) | ||
}))) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During INIT phase, instead of resolving API key, just initialize an API key factory.
@@ -415,11 +430,11 @@ async fn extension_loop_idle(client: &Client, r: &RegisterResponse) -> Result<() | |||
|
|||
#[allow(clippy::too_many_lines)] | |||
async fn extension_loop_active( | |||
aws_config: &AwsConfig, | |||
aws_config: Arc<AwsConfig>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing it to Arc
so it can be passed to ApiKeyFactory
and shared across threads.
@@ -26,42 +28,24 @@ pub struct Flusher { | |||
client: reqwest::Client, | |||
aggregator: Arc<Mutex<Aggregator>>, | |||
config: Arc<config::Config>, | |||
headers: HeaderMap, | |||
api_key_factory: Arc<ApiKeyFactory>, | |||
headers: OnceCell<HeaderMap>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lazily initialize headers
, which includes api key.
aws_config: &AwsConfig, | ||
aws_credentials: &mut AwsCredentials, | ||
aws_config: Arc<AwsConfig>, | ||
aws_credentials: Arc<RwLock<AwsCredentials>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A core change: added RwLock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah and we need an RwLock here because the factory will lazily write/update this struct member?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The factory fills in aws_access_key_id
, aws_secret_access_key
and aws_session_token
if we are in snap start.
@@ -34,36 +36,47 @@ pub struct ServerlessStatsFlusher { | |||
// pub buffer: Arc<Mutex<Vec<pb::ClientStatsPayload>>>, | |||
aggregator: Arc<Mutex<StatsAggregator>>, | |||
config: Arc<config::Config>, | |||
endpoint: Endpoint, | |||
api_key_factory: Arc<ApiKeyFactory>, | |||
endpoint: OnceCell<Endpoint>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lazily resolve endpoint
, which contains api key.
bottlecap/src/bin/bottlecap/main.rs
Outdated
Box::pin(async move { | ||
resolve_secrets(config, aws_config, aws_credentials) | ||
.await | ||
.expect("Failed to resolve API key") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any better way to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm ideally, if failing on resolve you'd enter the noop loop, what would happen if you added that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will panic at flush time and stop the runtime, so I wonder if there's a way for the extension to stop gracefully at that time without stopping the runtime.
d623d67
to
3724235
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work Yiming! Let's get this onto self monitoring to test for a few days
cec3247
to
727d04f
Compare
bottlecap/src/bin/bottlecap/main.rs
Outdated
let api_key_factory = { | ||
let config = Arc::clone(&config); | ||
let aws_config = Arc::clone(&aws_config); | ||
let aws_credentials = Arc::clone(&aws_credentials); | ||
|
||
Arc::new(ApiKeyFactory::new_from_resolver(Arc::new(move || { | ||
let config = Arc::clone(&config); | ||
let aws_config = Arc::clone(&aws_config); | ||
let aws_credentials = Arc::clone(&aws_credentials); | ||
|
||
Box::pin(async move { | ||
resolve_secrets(config, aws_config, aws_credentials) | ||
.await | ||
.expect("Failed to resolve API key") | ||
}) | ||
}))) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we could make this a method here on in the api key resolver, really don't like the pattern of having the declaration of the same variable 3 times just because we're nesting it, given how main code has increasingly grown, it would be good to hide this in some way and document it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Extracted a function create_api_key_factory()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR @lym953 !
3be2928
to
e820be0
Compare
4c1162d
to
07aa341
Compare
…745) # Background Right now `SendData` is passed around across channels. # This PR Instead of passing `SendData`, pass `SendDataBuilderInfo`, which bundles `SendDataBuilder` and payload size. Just before flush, call `SendDataBuilder.build()` to build `SendData`. # Motivation DataDog/libdatadog#1140 (comment) It is suggested that the function `set_api_key()` shouldn't be added on `SendData`, but should be added on `SendDataBuilder`. Because need to call `set_api_key()` just before flush, we need to make sure the object is `SendDataBuilder` instead of `SendData` until flush time. And because we need payload size in Trace Aggregator, and `SendDataBuilder` doesn't expose this field, we need to pass it explicitly along with `SendDataBuilder`. # Next steps Update #717 #732 so that `get_api_key()` is called just before flush. # Dependency DataDog/libdatadog#1140
2ef2a9a
to
37caca4
Compare
Fix clippy
37caca4
to
ef63759
Compare
Motivation
From @astuyve:
https://datadoghq.atlassian.net/browse/SVLS-6995
Previous work
DataDog/serverless-components#21, DataDog/serverless-components#24 created
ApiKeyFactory
, which is a util to enable lazy API key resolution.This PR
Updates Bottlecap code to use
ApiKeyFactory
to lazily resolve API key, i.e. instead of resolving it by querying Secret Manager or KMS during init phase, do it at flushing time when api key is actually needed.Note
This PR changes the behavior when key resolution fails, i.e. when
resolve_secrets()
returnsNone
.extension_loop_idle()
, which does not stop the runtimeextension_loop_idle()
because api key resolution code is not in the main loop anymore, but in various consumer code of api keyUpdate: Added a PR to address resolution failure: #732
These two PRs should be merged together. Keeping them separate PRs just to make review easier.
Testing
Setup
Result
Below is the
Datadog Next-Gen Extension ready in:
time logged.Before: (prod extension
arn:aws:lambda:us-east-1:464622532012:layer:Datadog-Extension-ARM:82
)After: (test extension
arn:aws:lambda:us-east-1:425362996713:layer:Datadog-Bottlecap-Beta-ARM-yiming:2
)Both use 5 samples.
Notes
https://datadoghq.atlassian.net/issues/SVLS-6996
https://datadoghq.atlassian.net/issues/SVLS-6998