feat: Handle API key resolution failure #732

lym953 · 2025-07-07T21:11:21Z

Context

The previous PR #717 defers API key resolution from extension init stage to flush time. However, that PR doesn't well handle the failure case.

Before that PR, if resolution fails in init stage, the extension will run an idle loop.
After that PR, the extension will crash at flush time, which will kill the runtime as well, which is not desired.

What does this PR do?

For traces, defer key resolution from TraceProcessor.process_traces() to TraceFlusher.flush().
- (This should ideally be in the previous PR, but since that is already approved, let me add this change in this new PR.)
If resolution fails at flush time, then make flush a no-op, so the extension can keep running and consume events without crashing.

Dependencies

Manual Test

Steps

Create a layer in sandbox
Apply the layer to a Lambda function
Set the env var DD_API_KEY_SECRET_ARN to an invalid value
Run the Lambda
Then set DD_API_KEY_SECRET_ARN to a valid value
Run the Lambda

Result

The function was successful

The extension printed some error logs

With valid secret ARN, the Lambda runs successfully and reports to Datadog

Automated Test

I didn't add any automated test because from what I see in the codebase, existing tests are usually unit tests for short functions and not for long functions that this PR touches. Please let me know if you think I should add automated tests.

lym953 · 2025-07-07T21:29:19Z

bottlecap/src/lifecycle/invocation/processor.rs

+                debug!("Failed to send context spans to agent: {e}");
+            }
+        } else {
+            error!("Failed to process traces, skipping send");


Processor won't send spans to TraceAggregator

can't we just avoid the extra allocation by doing if let Some(send_data) = trace_processor... else {} ?

lym953 · 2025-07-07T21:32:40Z

bottlecap/src/logs/flusher.rs

+                if let Some(req) = self.create_request(batch.clone()).await {
+                    set.spawn(async move { Self::send(req).await });
+                } else {
+                    error!("Failed to create request");


Flusher won't create HTTP requests to send to data to Datadog at /api/v2/logs

lym953 · 2025-07-07T21:37:09Z

bottlecap/src/otlp/agent.rs

            }
+        } else {
+            error!("Failed to process traces, skipping send");


OLTP Agent won't send traces to TraceFlusher

lym953 · 2025-07-07T21:38:58Z

bottlecap/src/traces/stats_flusher.rs

-            }
-        };
+        } else {
+            error!("Failed to create endpoint");


ServerlessStatsFlusher won't send stats to Datadog's endpoint.

lym953 · 2025-07-07T21:41:31Z

bottlecap/src/traces/trace_agent.rs

+                ),
+            }
+        } else {
+            error!("Failed to process traces, skipping send");


TraceAgent won't send traces to TraceFlusher

lym953 · 2025-07-07T21:46:20Z

bottlecap/src/traces/trace_agent.rs

+                ),
+            }
+        } else {
+            error_response(


TraceAgent proxy won't send data to Datadog

lym953 · 2025-07-07T21:48:13Z

bottlecap/src/traces/trace_processor.rs

+            ));
+            Some(send_data)
+        } else {
+            error!("Failed to resolve API key");


TraceProcessor won't process traces

Copilot

Pull Request Overview

This PR enhances resilience by turning API key resolution failures into no-ops instead of crashing, allowing the extension to continue running. Key changes include:

Converting process_traces to return Option<SendData> and guarding all flush/send paths across multiple components.
Adding if let Some checks around API key resolution in trace, stats, logs, OTLP, and invocation processors.
Updating the dogstatsd dependency revision in Cargo.toml.

Reviewed Changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
trace_processor.rs	Changed return type to `Option<SendData>` and added `if let Some(api_key)` guard around endpoint construction.
trace_agent.rs	Added initial `None` check for `send_data` and unified error responses when API key or send_data is missing.
stats_flusher.rs	Changed endpoint cell to `OnceCell<Option<Endpoint>>` and wrapped stats send logic in `if let Some` for API key and endpoint.
otlp/agent.rs	Wrapped `process_traces` result in `if let Some(send_data)` to skip sending when API key resolution fails.
logs/flusher.rs	Changed cached headers to `OnceCell<Option<HeaderMap>>` and made `create_request` return `Option<…>` to skip sends.
lifecycle/invocation/processor.rs	Updated invocation processor to skip sending when `process_traces` returns `None`.
Cargo.toml	Bumped `dogstatsd` revision to `0add16260cca1ec01729a3d99f5a40cf246a2c38`.

Comments suppressed due to low confidence (2)

bottlecap/src/traces/trace_processor.rs:170

The call to to_string().clone() is redundant; to_string() already returns a String. You can simplify to api_key: Some(api_key.into()),.

                api_key: Some(api_key.to_string().into()),

bottlecap/src/traces/trace_processor.rs:130

Consider adding a unit test for the new None return path when API key resolution fails, to ensure that process_traces correctly returns None and skips sending.

    ) -> Option<SendData>;

Copilot · 2025-07-08T19:04:46Z

bottlecap/src/traces/trace_agent.rs

+        } else {
+            error!("Failed to process traces, skipping send");
+            error_response(
                StatusCode::INTERNAL_SERVER_ERROR,
-                format!("Error sending traces to the trace flusher: {err}"),
-            ),
+                "Failed to process traces, skipping send",
+            )


This else branch is unreachable because you return above when send_data is None. Consider removing the second if let/else and unifying error handling for clarity.

Suggested change

} else {

error!("Failed to process traces, skipping send");

error_response(

StatusCode::INTERNAL_SERVER_ERROR,

format!("Error sending traces to the trace flusher: {err}"),

),

"Failed to process traces, skipping send",

)

Copilot · 2025-07-08T19:04:46Z

bottlecap/src/traces/stats_flusher.rs

+                        Some(Endpoint {
+                            url: hyper::Uri::from_str(&stats_url)
+                                .expect("can't make URI from stats url, exiting"),
+                            api_key: Some(api_key.to_string().clone().into()),


Similar to the other location, the .clone() on the result of to_string() is redundant. Use api_key.into() to simplify.

Suggested change

api_key: Some(api_key.to_string().clone().into()),

api_key: Some(api_key.into()),

Won't do. I would get an error after doing this:

ava-silver · 2025-07-08T19:42:14Z

bottlecap/src/logs/flusher.rs

+                if let Some(req) = self.create_request(batch.clone()).await {
+                    set.spawn(async move { Self::send(req).await });
+                } else {
+                    error!("Failed to create request");
+                    continue;
+                }


let-else is the normal way to avoid extra indentation

Suggested change

if let Some(req) = self.create_request(batch.clone()).await {

set.spawn(async move { Self::send(req).await });

} else {

error!("Failed to create request");

continue;

}

let Some(req) = self.create_request(batch.clone()).await else {

error!("Failed to create request");

continue;

}

set.spawn(async move { Self::send(req).await });

That's the perfect answer!!

duncanista · 2025-07-09T17:04:41Z

bottlecap/src/logs/flusher.rs

-        let headers = self.get_headers().await;
-        self.client
+        let Some(headers) = self.get_headers().await else {
+            return None;


Mmm, although I understand the code, I find it a little confusing that get_headers is responsible for deciding wether or not we're creating a request.

Would it make more sense to rearchitect this so that whenever we definitely know we are about to flush, let's say in flush() method, we try to get the API Key?

duncanista

Left a comment which I would like to see if we can work around. The main idea is, could we rearchitect so that whenever we hit flush we try to resolve the API key and then start doing later work based on it? Instead, we're failing in headers when trying to get an API key, but this looks like they should be separated 🤔

LMK what you think

lym953 · 2025-07-09T19:36:02Z

bottlecap/src/lifecycle/invocation/processor.rs

@@ -78,6 +79,8 @@ pub struct Processor {
    ///
    /// These tags are used to capture runtime and initialization.
    dynamic_tags: HashMap<String, String>,
+    /// Function to resolve Datadog API key.
+    api_key_factory: Arc<ApiKeyFactory>,


Add it to an outer struct Processor

lym953 · 2025-07-09T19:36:51Z

bottlecap/src/lifecycle/invocation/processor.rs

@@ -502,6 +507,11 @@ impl Processor {
        trace_processor: &Arc<dyn TraceProcessor + Send + Sync>,
        trace_agent_tx: &Sender<SendData>,
    ) {
+        let Some(api_key) = self.api_key_factory.get_api_key().await else {


... so we can abort earlier here, without needing to touch many functions

lym953 · 2025-07-09T19:39:09Z

bottlecap/src/logs/flusher.rs

@@ -55,13 +55,18 @@ impl Flusher {

    pub async fn flush(&self, batches: Option<Arc<Vec<Vec<u8>>>>) -> Vec<reqwest::RequestBuilder> {
        let mut set = JoinSet::new();
+        let api_key = self.api_key_factory.get_api_key().await;
+        let Some(api_key) = api_key else {


Abort early at the beginning of Flusher.flush()

lym953 · 2025-07-09T19:40:08Z

bottlecap/src/lifecycle/invocation/processor.rs

            .process_traces(
                self.config.clone(),
                tags_provider.clone(),
                header_tags,
                vec![traces],
                body_size,
                self.inferrer.span_pointers.clone(),
+                api_key,


Passing api_key to process_traces(), so process_traces() won't need to handle failure inside.

lym953 · 2025-07-09T19:40:55Z

bottlecap/src/logs/flusher.rs

@@ -146,10 +151,9 @@ impl Flusher {
        }
    }

-    async fn get_headers(&self) -> &HeaderMap {
+    async fn get_headers(&self, api_key: &str) -> &HeaderMap {


Passing in api_key, so get_headers() won't need to handle the failure

lym953 · 2025-07-09T19:41:22Z

bottlecap/src/otlp/agent.rs

@@ -30,6 +32,7 @@ type AgentState = (
    OtlpProcessor,
    Arc<dyn TraceProcessor + Send + Sync>,
    Sender<SendData>,
+    Arc<ApiKeyFactory>,


Adding it to AgentState and Agent

lym953 · 2025-07-09T19:42:25Z

bottlecap/src/otlp/agent.rs

        request: Request,
    ) -> Response {
+        let Some(api_key) = api_key_factory.get_api_key().await else {


Abort at the beginning of v1_traces API handler

Is it possible that the customer's code calls /v1/traces api synchronously, and we slow down the customer's Lambda by doing the heavy operation of resolving api key here?
If so, it might be better to further defer key resolution by moving it out of the API handler.

This gets called by an exporter, probably at the end of the function, so yeah, it would be done in runtime time

lym953 · 2025-07-09T19:43:00Z

bottlecap/src/traces/stats_flusher.rs

@@ -60,16 +60,20 @@ impl StatsFlusher for ServerlessStatsFlusher {
            return;
        }

+        let Some(api_key) = self.api_key_factory.get_api_key().await else {


Abort at the beginning of StatsFlusher.send()

lym953 · 2025-07-09T19:44:14Z

bottlecap/src/traces/trace_agent.rs

    ) -> Response {
+        let Some(api_key) = api_key_factory.get_api_key().await else {


Abort at the beginning of v0.4 and v0.5 traces API handler

dit: Is it possible that the customer's code calls traces api synchronously, and we slow down the customer's Lambda by doing the heavy operation of resolving api key here?
If so, it might be better to further defer key resolution by moving it out of the API handler.

yeah we need to defer this to the flusher as this API is called synchronously

lym953 · 2025-07-09T19:47:26Z

bottlecap/src/traces/trace_agent.rs

@@ -531,6 +546,14 @@ impl TraceAgent {
            Err(e) => return error_response(StatusCode::INTERNAL_SERVER_ERROR, e),
        };

+        let Some(api_key) = api_key_factory.get_api_key().await else {


Abort at the beginning of handle_proxy()

lym953 · 2025-07-09T20:02:06Z

Left a comment which I would like to see if we can work around. The main idea is, could we rearchitect so that whenever we hit flush we try to resolve the API key and then start doing later work based on it? Instead, we're failing in headers when trying to get an API key, but this looks like they should be separated

@duncanista Good point! Made a lot of changes.

One concern is this PR (and the last one) only defers key resolution from init time to trace API handler (if trace API handler is called), not to flush time. Although it can shorten cold start time, it can make invoke phase slower. Is that a problem? (Correct me if my understanding is wrong.)

duncanista · 2025-07-10T17:00:38Z

bottlecap/src/otlp/agent.rs

@@ -22,6 +22,8 @@ use crate::{
    traces::trace_processor::TraceProcessor,
 };

+use dogstatsd::api_key::ApiKeyFactory;


Question, do we want to move this to its own file? Wondering if all other components should be relying in dogstatsd as a dependency just for an ApiKeyFactory

Not needed to be done now, but would be good to not make them dependent on a metrics module

lym953 · 2025-07-10T23:01:12Z

bottlecap/src/config/mod.rs

@@ -197,7 +197,7 @@ impl ConfigBuilder {
        // or in the `proxy_no_proxy` config field.
        if self.config.proxy_https.is_some() {
            let site_in_no_proxy = std::env::var("NO_PROXY")
-                .map_or(false, |no_proxy| no_proxy.contains(&self.config.site))
+                .is_ok_and(|no_proxy| no_proxy.contains(&self.config.site))


fixes a new clippy error due to upgrade

lym953 · 2025-07-10T23:01:31Z

bottlecap/src/config/mod.rs

@@ -543,7 +543,7 @@ where
 {
    struct KeyValueVisitor;

-    impl<'de> serde::de::Visitor<'de> for KeyValueVisitor {
+    impl serde::de::Visitor<'_> for KeyValueVisitor {


fixes a new clippy error due to upgrade

lym953 · 2025-07-10T23:03:28Z

bottlecap/src/traces/stats_aggregator.rs

@@ -1,6 +1,7 @@
 use datadog_trace_protobuf::pb::ClientStatsPayload;
 use std::collections::VecDeque;

+#[allow(clippy::empty_line_after_doc_comments)]


New clippy error

lym953 · 2025-07-10T23:04:15Z

bottlecap/src/traces/trace_agent.rs

@@ -85,7 +84,6 @@ pub struct StatsState {
 pub struct ProxyState {
    pub config: Arc<config::Config>,
    pub proxy_aggregator: Arc<Mutex<proxy_aggregator::Aggregator>>,
-    pub api_key_factory: Arc<ApiKeyFactory>,


Moved to trace_flusher.

lym953 · 2025-07-10T23:04:56Z

bottlecap/src/traces/trace_flusher.rs

@@ -57,6 +76,10 @@ impl TraceFlusher for ServerlessTraceFlusher {
        // Process new traces from the aggregator
        let mut guard = self.aggregator.lock().await;
        let mut traces = guard.get_batch();
+        // Lazily set the API key
+        for trace in &mut traces {
+            trace.get_target_mut().api_key = Some(api_key.to_string().into());


I need to add get_target_mut() in DataDog/libdatadog#1140

…745) # Background Right now `SendData` is passed around across channels. # This PR Instead of passing `SendData`, pass `SendDataBuilderInfo`, which bundles `SendDataBuilder` and payload size. Just before flush, call `SendDataBuilder.build()` to build `SendData`. # Motivation DataDog/libdatadog#1140 (comment) It is suggested that the function `set_api_key()` shouldn't be added on `SendData`, but should be added on `SendDataBuilder`. Because need to call `set_api_key()` just before flush, we need to make sure the object is `SendDataBuilder` instead of `SendData` until flush time. And because we need payload size in Trace Aggregator, and `SendDataBuilder` doesn't expose this field, we need to pass it explicitly along with `SendDataBuilder`. # Next steps Update #717 #732 so that `get_api_key()` is called just before flush. # Dependency DataDog/libdatadog#1140

Simplify logic in StatsFlusher Move api_key_factory out of TraceProcessor Move some code Avoid resolving key in trace api and proxy Apply to proxy flusher Resolve conflicts Make trace flusher resolve api key Fix Clippy lint Format Use SendData.set_api_key() Fix errors Improve comments

lym953 mentioned this pull request Jul 7, 2025

feat: Make ApiKeyFactory return Option<String> DataDog/serverless-components#25

Merged

lym953 commented Jul 7, 2025

View reviewed changes

lym953 changed the title ~~feat: Properly handle API key resolution failure~~ feat: Handle API key resolution failure Jul 7, 2025

lym953 mentioned this pull request Jul 8, 2025

feat: Lazily resolve api key #717

Open

lym953 requested a review from Copilot July 8, 2025 19:03

Copilot AI reviewed Jul 8, 2025

View reviewed changes

lym953 force-pushed the yiming.luo/lazy-api-key-error branch from 1b2e85f to b60bd54 Compare July 8, 2025 19:19

lym953 marked this pull request as ready for review July 8, 2025 19:23

lym953 requested a review from a team as a code owner July 8, 2025 19:23

ava-silver reviewed Jul 8, 2025

View reviewed changes

lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from cec3247 to 727d04f Compare July 9, 2025 15:47

lym953 force-pushed the yiming.luo/lazy-api-key-error branch from e64164e to 29a299d Compare July 9, 2025 15:49

duncanista reviewed Jul 9, 2025

View reviewed changes

lym953 force-pushed the yiming.luo/lazy-api-key-error branch 2 times, most recently from 66676fe to d10fb07 Compare July 9, 2025 17:59

lym953 commented Jul 9, 2025

View reviewed changes

duncanista reviewed Jul 10, 2025

View reviewed changes

lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from 3be2928 to e820be0 Compare July 10, 2025 20:18

lym953 force-pushed the yiming.luo/lazy-api-key-error branch 2 times, most recently from 4ff6353 to aaeefdd Compare July 10, 2025 22:40

lym953 commented Jul 10, 2025

View reviewed changes

This was referenced Jul 11, 2025

Add functions to SendDataBuilder DataDog/libdatadog#1140

Merged

chore: pass SendDataBuilderInfo instead of SendData until flush time #745

Merged

lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from 2ef2a9a to 37caca4 Compare July 16, 2025 20:51

lym953 force-pushed the yiming.luo/lazy-api-key-error branch from 7397582 to 43030bf Compare July 16, 2025 21:39

lym953 force-pushed the yiming.luo/lazy-api-key-3 branch from 37caca4 to ef63759 Compare July 16, 2025 21:46

lym953 force-pushed the yiming.luo/lazy-api-key-error branch from 13449b8 to d10e087 Compare July 16, 2025 21:47

Fix clippy

7fe772c

	api_key: Some(api_key.to_string().clone().into()),
	api_key: Some(api_key.into()),

		) -> Response {
		let Some(api_key) = api_key_factory.get_api_key().await else {

feat: Handle API key resolution failure #732

Are you sure you want to change the base?

feat: Handle API key resolution failure #732

Uh oh!

Conversation

lym953 commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What does this PR do?

Dependencies

Manual Test

Steps

Result

Automated Test

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanista left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

lym953 commented Jul 7, 2025 •

edited

Loading

lym953 commented Jul 9, 2025 •

edited

Loading