Alerting and Log Tracing

When developing and maintaining a serverless application like the ACM Core Platform, effective alerting and log tracing becomes essential for troubleshooting issues. This guide explains how to respond to alerts and how to trace requests through our system.

Understanding Alerting

TODO: Write this section :)

Understanding Request Tracing

In a distributed serverless architecture, a single user action may trigger multiple services (API Gateway, Lambda, SQS, etc.). Correlating logs across these services is crucial for understanding the complete request lifecycle and identifying where issues occur.

Our platform implements a consistent request tracing strategy that allows you to follow a request through all system components.

Tracing API Requests

Step 1: Find the Request ID

Every HTTP response from our API includes an X-Amzn-Request-ID header. This ID serves as the primary correlation key for tracing the request through our system.

To find this ID:

In browser developer tools, check the Network tab and inspect the response headers
If using Postman or similar tools, check the Response Headers section
For browser errors, review the console logs which often contain the request ID

The X-Amzn-Request-ID header looks something like: 76e06578-9d37-4736-a1fa-e3c1bf3a421b

Step 2: Search CloudWatch Logs

Once you have the request ID:

Go to AWS CloudWatch Logs
Navigate to the /aws/lambda/infra-core-api-lambda log group
- We use the same CloudWatch log group for both API and SQS. SQS log streams are prefixed with infra-core-api-sqs-lambda[$LATEST].
Use the search functionality to find logs containing your request ID

In our logs, the request ID appears in the reqId field:

{
  "level": 30,
  "time": 1742660845671,
  "pid": 2,
  "hostname": "169.254.98.121",
  "reqId": "18c3880f-1ef3-4534-9ecc-fb287ab274d3",
  "url": "/api/v1/healthz",
  "statusCode": 200,
  "durationMs": 1,
  "msg": "request completed"
}

This will return all log entries related to that specific request, allowing you to trace it through the system.

Tracing SQS Messages

Asynchronous operations that use SQS require a slightly different approach.

For Messages Initiated by API Requests

When an API request triggers an SQS message (e.g., sending an email, provisioning a membership):

Find the X-Amzn-Request-ID as described above
Search for this ID in the logs to find the SQS message ID that was generated
Once you have the SQS message ID, search for that ID to trace the message processing

Some API routes return the queueId in the response body with HTTP status 202. If you have the Queue ID, it might be easier to use that and query the SQS log streams.

Our logs include both the original request ID and the SQS message ID in the metadata, creating a correlation chain.

For Webhook-Triggered Messages

For SQS messages triggered by webhooks (e.g., Stripe events):

Find the Stripe event ID from the webhook payload or Stripe dashboard (starting with evt_).
Search the logs for this event ID, which will be in the initiator field of our SQS metadata
Alternatively, if you know the approximate time, you can search for the type of webhook (e.g., checkout.session.completed)

Using Message IDs in the SQS Consumer Logs

Every log entry in our SQS consumer includes the SQS message ID in a structured field, making it easy to trace a specific message's processing:

{
  "level": 30,
  "time": 1742433189419,
  "pid": 2,
  "hostname": "169.254.7.197",
  "context": "sqsHandler",
  "sqsMessageId": "ab562a92-1d62-4820-94fa-8a67cd767e39",
  "metadata": {
    "reqId": "a9392702-d807-4248-b0dd-4f23f3a58cf5",
    "initiator": "evt_1R4XfTDiGOXU9RuSLmMwOcQU"
  },
  "function": "provisionNewMember",
  "msg": "Starting handler for provisionNewMember..."
}

Notice how the log contains:

sqsMessageId: The unique ID of the SQS message
metadata.reqId: The original request ID that triggered this operation
metadata.initiator: The event ID (in this case a Stripe event) that initiated the process

To trace a specific SQS message:

Search for the SQS message ID using: { $.sqsMessageId = "eb60c704-eae9-4d74-b7e1-c6c6cd3fddad" }
Or search for the original request: { $.metadata.reqId = "d262f298-8bb4-4099-8798-53b06c674f09" }
Or search for a Stripe event: { $.metadata.initiator = "evt_1R4XfTDiGOXU9RuSLmMwOcQU" }

Checking SQS Queue Statistics for Issues

When troubleshooting SQS-related problems, examining queue statistics can provide valuable insights:

Go to the AWS SQS console
Select the relevant queue (e.g., infra-core-api-sqs-queue)
Review these key metrics:
- ApproximateNumberOfMessages: Current number of messages available for retrieval
- ApproximateNumberOfMessagesDelayed: Messages in the queue that are delayed and not yet available for processing
- ApproximateNumberOfMessagesNotVisible: Messages that are being processed but not yet deleted
- OldestMessageAge: Age of the oldest message in the queue; helps identify processing delays

Unusually high values in these metrics often indicate processing bottlenecks:

High ApproximateNumberOfMessages suggests messages are being produced faster than they can be consumed
High ApproximateNumberOfMessagesNotVisible might indicate consumers are taking too long to process messages or are failing to delete them after processing
High OldestMessageAge can reveal stuck messages or processing issues

You can also examine the CloudWatch metrics for SQS queues to identify trends over time:

Go to CloudWatch Metrics
Select "SQS" from the metrics namespace
Choose the queue name and relevant metrics
Check for patterns such as:
- Sudden increases in message count
- Spikes in processing time
- Consistent growth in queue depth without corresponding processing activity

For messages that fail processing multiple times, check the Dead-Letter Queue (DLQ):

Go to the AWS SQS console
Select the dead-letter queue (e.g., infra-core-api-sqs-dlq)
View messages in the queue
Examine the message attributes and body to understand the failure cause
Check how many times the message was received before being sent to the DLQ (using the ApproximateReceiveCount attribute)

Finding Audit Logs

Audit logs capture important operational events like permission changes, critical data modifications, and sensitive actions.

To find audit logs:

Search CloudWatch logs with the filter pattern: { $.type = "audit" }
This will return all structured log entries that have the audit type

For example, a typical audit log might look like:

{
  "level": 30,
  "time": 1742433190419,
  "pid": 2,
  "hostname": "169.254.7.197",
  "type": "audit",
  "module": "iam",
  "actor": "[email protected]",
  "target": "[email protected]",
  "msg": "added target to group ID efd48828-16ec-4035-8445-e8efaafe50c9"
}

You can further refine your search by adding additional criteria:

{ $.type = "audit" && $.module = "iam" } - For IAM-related audit events
{ $.type = "audit" && $.actor = "[email protected]" } - For actions by a specific user
{ $.type = "audit" && $.target = "[email protected]" } - For actions affecting a specific user

Using the Infrastructure Dashboard

For a higher-level view of system health and issues, we maintain a CloudWatch dashboard in Prod only:

Go to ACM Infrastructure Dashboard
Review the metrics and log widgets for warnings and errors
Drill down into specific issues by clicking on the relevant widgets

The dashboard provides:

Error counts by service
Lambda function performance metrics
SQS queue statistics
API Gateway metrics
Recent warnings and errors from all services

Troubleshooting Specific Scenarios

API Authorization Issues

If a user reports being unable to access a feature they should have permission for:

Make sure that the user is in fact assigned to an access group that grants them access to the resource
Have them try in Incognito
- If the request works in Incognito, it's likely a browser cache issue. Clear their cache/cookies and try again.
Ask for the request ID from their browser console or network tab
Search logs using: { $.reqId = "Request ID Here" }
Look for authorization-related messages
Check what roles were resolved for the user
Compare against the required roles for the endpoint \

Failed SQS Message Processing

If an asynchronous operation fails (e.g., email not sent, membership not provisioned):

Find the initial request that triggered the SQS message using its request ID
Locate the SQS message ID in the logs
Search for that message ID using: { $.sqsMessageId = "Message ID Here" }
Look for error messages or exceptions
Check if the message was sent to the dead-letter queue
Examine SQS queue metrics as described in the "Checking SQS Queue Statistics for Issues" section

For example, you might find an error like:

{
  "level": 50,
  "time": 1742433189419,
  "pid": 2,
  "hostname": "169.254.7.197",
  "context": "sqsHandler",
  "sqsMessageId": "cbef03c3-be75-4624-8250-a2f28167df75",
  "function": "provisionNewMember",
  "err": {
    "name": "EntraGroupError",
    "message": "Could not add user to group"
  },
  "msg": "Failed to process SQS message"
}

Webhook Processing Issues

If a webhook (e.g., Stripe) isn't processing correctly:

Find the webhook event ID in the external service (e.g., Stripe dashboard)
Search logs for that event ID using: { $.metadata.initiator = "Stripe Event ID Here" }
Check if the webhook was received by the API
Verify if an SQS message was created
Trace the message processing as described above

Best Practices for Log Analysis

Use JSON filters: Always use structured JSON queries like { $.reqId = "Request ID Here" } instead of simple text searches for more accurate results.
Start with the request ID: Always begin tracing from the client-facing request ID when available.
Look for error chains: When you find an error, check for earlier errors or warnings that might have led to it.
Check surrounding context: Filter logs around the timestamp of an error to find related events that might not share the same request ID.
Follow the trail: Pay attention to how a request flows through different components (API → SQS → Lambda) by following the correlation IDs.
Use time window filtering: When dealing with high-volume logs, narrow down by time first, then apply more specific filters.
Correlate SQS metrics with logs: When investigating SQS issues, always check both the queue metrics and the related log messages to get a complete picture of the problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alerting and Log Tracing

Understanding Alerting

Understanding Request Tracing

Tracing API Requests

Step 1: Find the Request ID

Step 2: Search CloudWatch Logs

Tracing SQS Messages

For Messages Initiated by API Requests

For Webhook-Triggered Messages

Using Message IDs in the SQS Consumer Logs

Checking SQS Queue Statistics for Issues

Finding Audit Logs

Using the Infrastructure Dashboard

Troubleshooting Specific Scenarios

API Authorization Issues

Failed SQS Message Processing

Webhook Processing Issues

Best Practices for Log Analysis

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally