-
Notifications
You must be signed in to change notification settings - Fork 0
Alerting and Log Tracing
When developing and maintaining a serverless application like the ACM Core Platform, effective alerting and log tracing becomes essential for troubleshooting issues. This guide explains how to respond to alerts and how to trace requests through our system.
TODO: Write this section :)
In a distributed serverless architecture, a single user action may trigger multiple services (API Gateway, Lambda, SQS, etc.). Correlating logs across these services is crucial for understanding the complete request lifecycle and identifying where issues occur.
Our platform implements a consistent request tracing strategy that allows you to follow a request through all system components.
Every HTTP response from our API includes an X-Amzn-Request-ID
header. This ID serves as the primary correlation key for tracing the request through our system.
To find this ID:
- In browser developer tools, check the Network tab and inspect the response headers
- If using Postman or similar tools, check the Response Headers section
- For browser errors, review the console logs which often contain the request ID
The X-Amzn-Request-ID
header looks something like: 76e06578-9d37-4736-a1fa-e3c1bf3a421b
Once you have the request ID:
-
Go to AWS CloudWatch Logs
-
Navigate to the
/aws/lambda/infra-core-api-lambda
log group- We use the same CloudWatch log group for both API and SQS. SQS log streams are prefixed with
infra-core-api-sqs-lambda[$LATEST]
.
- We use the same CloudWatch log group for both API and SQS. SQS log streams are prefixed with
-
Use the search functionality to find logs containing your request ID
In our logs, the request ID appears in the reqId
field:
{
"level": 30,
"time": 1742660845671,
"pid": 2,
"hostname": "169.254.98.121",
"reqId": "18c3880f-1ef3-4534-9ecc-fb287ab274d3",
"url": "/api/v1/healthz",
"statusCode": 200,
"durationMs": 1,
"msg": "request completed"
}
This will return all log entries related to that specific request, allowing you to trace it through the system.
Asynchronous operations that use SQS require a slightly different approach.
When an API request triggers an SQS message (e.g., sending an email, provisioning a membership):
- Find the
X-Amzn-Request-ID
as described above - Search for this ID in the logs to find the SQS message ID that was generated
- Once you have the SQS message ID, search for that ID to trace the message processing
Some API routes return the queueId
in the response body with HTTP status 202. If you have the Queue ID, it might be easier to use that and query the SQS log streams.
Our logs include both the original request ID and the SQS message ID in the metadata, creating a correlation chain.
For SQS messages triggered by webhooks (e.g., Stripe events):
- Find the Stripe event ID from the webhook payload or Stripe dashboard (starting with
evt_
). - Search the logs for this event ID, which will be in the
initiator
field of our SQS metadata - Alternatively, if you know the approximate time, you can search for the type of webhook (e.g.,
checkout.session.completed
)
Every log entry in our SQS consumer includes the SQS message ID in a structured field, making it easy to trace a specific message's processing:
{
"level": 30,
"time": 1742433189419,
"pid": 2,
"hostname": "169.254.7.197",
"context": "sqsHandler",
"sqsMessageId": "ab562a92-1d62-4820-94fa-8a67cd767e39",
"metadata": {
"reqId": "a9392702-d807-4248-b0dd-4f23f3a58cf5",
"initiator": "evt_1R4XfTDiGOXU9RuSLmMwOcQU"
},
"function": "provisionNewMember",
"msg": "Starting handler for provisionNewMember..."
}
Notice how the log contains:
-
sqsMessageId
: The unique ID of the SQS message -
metadata.reqId
: The original request ID that triggered this operation -
metadata.initiator
: The event ID (in this case a Stripe event) that initiated the process
To trace a specific SQS message:
- Search for the SQS message ID using:
{ $.sqsMessageId = "eb60c704-eae9-4d74-b7e1-c6c6cd3fddad" }
- Or search for the original request:
{ $.metadata.reqId = "d262f298-8bb4-4099-8798-53b06c674f09" }
- Or search for a Stripe event:
{ $.metadata.initiator = "evt_1R4XfTDiGOXU9RuSLmMwOcQU" }
When troubleshooting SQS-related problems, examining queue statistics can provide valuable insights:
- Go to the AWS SQS console
- Select the relevant queue (e.g.,
infra-core-api-sqs-queue
) - Review these key metrics:
- ApproximateNumberOfMessages: Current number of messages available for retrieval
- ApproximateNumberOfMessagesDelayed: Messages in the queue that are delayed and not yet available for processing
- ApproximateNumberOfMessagesNotVisible: Messages that are being processed but not yet deleted
- OldestMessageAge: Age of the oldest message in the queue; helps identify processing delays
Unusually high values in these metrics often indicate processing bottlenecks:
- High ApproximateNumberOfMessages suggests messages are being produced faster than they can be consumed
- High ApproximateNumberOfMessagesNotVisible might indicate consumers are taking too long to process messages or are failing to delete them after processing
- High OldestMessageAge can reveal stuck messages or processing issues
You can also examine the CloudWatch metrics for SQS queues to identify trends over time:
- Go to CloudWatch Metrics
- Select "SQS" from the metrics namespace
- Choose the queue name and relevant metrics
- Check for patterns such as:
- Sudden increases in message count
- Spikes in processing time
- Consistent growth in queue depth without corresponding processing activity
For messages that fail processing multiple times, check the Dead-Letter Queue (DLQ):
- Go to the AWS SQS console
- Select the dead-letter queue (e.g.,
infra-core-api-sqs-dlq
) - View messages in the queue
- Examine the message attributes and body to understand the failure cause
- Check how many times the message was received before being sent to the DLQ (using the
ApproximateReceiveCount
attribute)
Audit logs capture important operational events like permission changes, critical data modifications, and sensitive actions.
To find audit logs:
- Search CloudWatch logs with the filter pattern:
{ $.type = "audit" }
- This will return all structured log entries that have the audit type
For example, a typical audit log might look like:
{
"level": 30,
"time": 1742433190419,
"pid": 2,
"hostname": "169.254.7.197",
"type": "audit",
"module": "iam",
"actor": "[email protected]",
"target": "[email protected]",
"msg": "added target to group ID efd48828-16ec-4035-8445-e8efaafe50c9"
}
You can further refine your search by adding additional criteria:
-
{ $.type = "audit" && $.module = "iam" }
- For IAM-related audit events -
{ $.type = "audit" && $.actor = "[email protected]" }
- For actions by a specific user -
{ $.type = "audit" && $.target = "[email protected]" }
- For actions affecting a specific user
For a higher-level view of system health and issues, we maintain a CloudWatch dashboard in Prod only:
- Go to ACM Infrastructure Dashboard
- Review the metrics and log widgets for warnings and errors
- Drill down into specific issues by clicking on the relevant widgets
The dashboard provides:
- Error counts by service
- Lambda function performance metrics
- SQS queue statistics
- API Gateway metrics
- Recent warnings and errors from all services
If a user reports being unable to access a feature they should have permission for:
- Make sure that the user is in fact assigned to an access group that grants them access to the resource
- Have them try in Incognito
- If the request works in Incognito, it's likely a browser cache issue. Clear their cache/cookies and try again.
- Ask for the request ID from their browser console or network tab
- Search logs using:
{ $.reqId = "Request ID Here" }
- Look for authorization-related messages
- Check what roles were resolved for the user
- Compare against the required roles for the endpoint \
If an asynchronous operation fails (e.g., email not sent, membership not provisioned):
- Find the initial request that triggered the SQS message using its request ID
- Locate the SQS message ID in the logs
- Search for that message ID using:
{ $.sqsMessageId = "Message ID Here" }
- Look for error messages or exceptions
- Check if the message was sent to the dead-letter queue
- Examine SQS queue metrics as described in the "Checking SQS Queue Statistics for Issues" section
For example, you might find an error like:
{
"level": 50,
"time": 1742433189419,
"pid": 2,
"hostname": "169.254.7.197",
"context": "sqsHandler",
"sqsMessageId": "cbef03c3-be75-4624-8250-a2f28167df75",
"function": "provisionNewMember",
"err": {
"name": "EntraGroupError",
"message": "Could not add user to group"
},
"msg": "Failed to process SQS message"
}
If a webhook (e.g., Stripe) isn't processing correctly:
- Find the webhook event ID in the external service (e.g., Stripe dashboard)
- Search logs for that event ID using:
{ $.metadata.initiator = "Stripe Event ID Here" }
- Check if the webhook was received by the API
- Verify if an SQS message was created
- Trace the message processing as described above
-
Use JSON filters: Always use structured JSON queries like
{ $.reqId = "Request ID Here" }
instead of simple text searches for more accurate results. -
Start with the request ID: Always begin tracing from the client-facing request ID when available.
-
Look for error chains: When you find an error, check for earlier errors or warnings that might have led to it.
-
Check surrounding context: Filter logs around the timestamp of an error to find related events that might not share the same request ID.
-
Follow the trail: Pay attention to how a request flows through different components (API → SQS → Lambda) by following the correlation IDs.
-
Use time window filtering: When dealing with high-volume logs, narrow down by time first, then apply more specific filters.
-
Correlate SQS metrics with logs: When investigating SQS issues, always check both the queue metrics and the related log messages to get a complete picture of the problem.