Skip to content

Alerting and Log Tracing

Dev Singh edited this page Mar 23, 2025 · 2 revisions

When developing and maintaining a serverless application like the ACM Core Platform, effective alerting and log tracing becomes essential for troubleshooting issues. This guide explains how to respond to alerts and how to trace requests through our system.

Understanding Alerting

TODO: Write this section :)

Understanding Request Tracing

In a distributed serverless architecture, a single user action may trigger multiple services (API Gateway, Lambda, SQS, etc.). Correlating logs across these services is crucial for understanding the complete request lifecycle and identifying where issues occur.

Our platform implements a consistent request tracing strategy that allows you to follow a request through all system components.

Tracing API Requests

Step 1: Find the Request ID

Every HTTP response from our API includes an X-Amzn-Request-ID header. This ID serves as the primary correlation key for tracing the request through our system.

To find this ID:

  1. In browser developer tools, check the Network tab and inspect the response headers
  2. If using Postman or similar tools, check the Response Headers section
  3. For browser errors, review the console logs which often contain the request ID

The X-Amzn-Request-ID header looks something like: 76e06578-9d37-4736-a1fa-e3c1bf3a421b

Step 2: Search CloudWatch Logs

Once you have the request ID:

  1. Go to AWS CloudWatch Logs

  2. Navigate to the /aws/lambda/infra-core-api-lambda log group

    • We use the same CloudWatch log group for both API and SQS. SQS log streams are prefixed with infra-core-api-sqs-lambda[$LATEST].
  3. Use the search functionality to find logs containing your request ID

In our logs, the request ID appears in the reqId field:

{
  "level": 30,
  "time": 1742660845671,
  "pid": 2,
  "hostname": "169.254.98.121",
  "reqId": "18c3880f-1ef3-4534-9ecc-fb287ab274d3",
  "url": "/api/v1/healthz",
  "statusCode": 200,
  "durationMs": 1,
  "msg": "request completed"
}

This will return all log entries related to that specific request, allowing you to trace it through the system.

Tracing SQS Messages

Asynchronous operations that use SQS require a slightly different approach.

For Messages Initiated by API Requests

When an API request triggers an SQS message (e.g., sending an email, provisioning a membership):

  1. Find the X-Amzn-Request-ID as described above
  2. Search for this ID in the logs to find the SQS message ID that was generated
  3. Once you have the SQS message ID, search for that ID to trace the message processing

Some API routes return the queueId in the response body with HTTP status 202. If you have the Queue ID, it might be easier to use that and query the SQS log streams.

Our logs include both the original request ID and the SQS message ID in the metadata, creating a correlation chain.

For Webhook-Triggered Messages

For SQS messages triggered by webhooks (e.g., Stripe events):

  1. Find the Stripe event ID from the webhook payload or Stripe dashboard (starting with evt_).
  2. Search the logs for this event ID, which will be in the initiator field of our SQS metadata
  3. Alternatively, if you know the approximate time, you can search for the type of webhook (e.g., checkout.session.completed)

Using Message IDs in the SQS Consumer Logs

Every log entry in our SQS consumer includes the SQS message ID in a structured field, making it easy to trace a specific message's processing:

{
  "level": 30,
  "time": 1742433189419,
  "pid": 2,
  "hostname": "169.254.7.197",
  "context": "sqsHandler",
  "sqsMessageId": "ab562a92-1d62-4820-94fa-8a67cd767e39",
  "metadata": {
    "reqId": "a9392702-d807-4248-b0dd-4f23f3a58cf5",
    "initiator": "evt_1R4XfTDiGOXU9RuSLmMwOcQU"
  },
  "function": "provisionNewMember",
  "msg": "Starting handler for provisionNewMember..."
}

Notice how the log contains:

  • sqsMessageId: The unique ID of the SQS message
  • metadata.reqId: The original request ID that triggered this operation
  • metadata.initiator: The event ID (in this case a Stripe event) that initiated the process

To trace a specific SQS message:

  1. Search for the SQS message ID using: { $.sqsMessageId = "eb60c704-eae9-4d74-b7e1-c6c6cd3fddad" }
  2. Or search for the original request: { $.metadata.reqId = "d262f298-8bb4-4099-8798-53b06c674f09" }
  3. Or search for a Stripe event: { $.metadata.initiator = "evt_1R4XfTDiGOXU9RuSLmMwOcQU" }

Checking SQS Queue Statistics for Issues

When troubleshooting SQS-related problems, examining queue statistics can provide valuable insights:

  1. Go to the AWS SQS console
  2. Select the relevant queue (e.g., infra-core-api-sqs-queue)
  3. Review these key metrics:
    • ApproximateNumberOfMessages: Current number of messages available for retrieval
    • ApproximateNumberOfMessagesDelayed: Messages in the queue that are delayed and not yet available for processing
    • ApproximateNumberOfMessagesNotVisible: Messages that are being processed but not yet deleted
    • OldestMessageAge: Age of the oldest message in the queue; helps identify processing delays

Unusually high values in these metrics often indicate processing bottlenecks:

  • High ApproximateNumberOfMessages suggests messages are being produced faster than they can be consumed
  • High ApproximateNumberOfMessagesNotVisible might indicate consumers are taking too long to process messages or are failing to delete them after processing
  • High OldestMessageAge can reveal stuck messages or processing issues

You can also examine the CloudWatch metrics for SQS queues to identify trends over time:

  1. Go to CloudWatch Metrics
  2. Select "SQS" from the metrics namespace
  3. Choose the queue name and relevant metrics
  4. Check for patterns such as:
    • Sudden increases in message count
    • Spikes in processing time
    • Consistent growth in queue depth without corresponding processing activity

For messages that fail processing multiple times, check the Dead-Letter Queue (DLQ):

  1. Go to the AWS SQS console
  2. Select the dead-letter queue (e.g., infra-core-api-sqs-dlq)
  3. View messages in the queue
  4. Examine the message attributes and body to understand the failure cause
  5. Check how many times the message was received before being sent to the DLQ (using the ApproximateReceiveCount attribute)

Finding Audit Logs

Audit logs capture important operational events like permission changes, critical data modifications, and sensitive actions.

To find audit logs:

  1. Search CloudWatch logs with the filter pattern: { $.type = "audit" }
  2. This will return all structured log entries that have the audit type

For example, a typical audit log might look like:

{
  "level": 30,
  "time": 1742433190419,
  "pid": 2,
  "hostname": "169.254.7.197",
  "type": "audit",
  "module": "iam",
  "actor": "[email protected]",
  "target": "[email protected]",
  "msg": "added target to group ID efd48828-16ec-4035-8445-e8efaafe50c9"
}

You can further refine your search by adding additional criteria:

  • { $.type = "audit" && $.module = "iam" } - For IAM-related audit events
  • { $.type = "audit" && $.actor = "[email protected]" } - For actions by a specific user
  • { $.type = "audit" && $.target = "[email protected]" } - For actions affecting a specific user

Using the Infrastructure Dashboard

For a higher-level view of system health and issues, we maintain a CloudWatch dashboard in Prod only:

  1. Go to ACM Infrastructure Dashboard
  2. Review the metrics and log widgets for warnings and errors
  3. Drill down into specific issues by clicking on the relevant widgets

The dashboard provides:

  • Error counts by service
  • Lambda function performance metrics
  • SQS queue statistics
  • API Gateway metrics
  • Recent warnings and errors from all services

Troubleshooting Specific Scenarios

API Authorization Issues

If a user reports being unable to access a feature they should have permission for:

  1. Make sure that the user is in fact assigned to an access group that grants them access to the resource
  2. Have them try in Incognito
    • If the request works in Incognito, it's likely a browser cache issue. Clear their cache/cookies and try again.
  3. Ask for the request ID from their browser console or network tab
  4. Search logs using: { $.reqId = "Request ID Here" }
  5. Look for authorization-related messages
  6. Check what roles were resolved for the user
  7. Compare against the required roles for the endpoint \

Failed SQS Message Processing

If an asynchronous operation fails (e.g., email not sent, membership not provisioned):

  1. Find the initial request that triggered the SQS message using its request ID
  2. Locate the SQS message ID in the logs
  3. Search for that message ID using: { $.sqsMessageId = "Message ID Here" }
  4. Look for error messages or exceptions
  5. Check if the message was sent to the dead-letter queue
  6. Examine SQS queue metrics as described in the "Checking SQS Queue Statistics for Issues" section

For example, you might find an error like:

{
  "level": 50,
  "time": 1742433189419,
  "pid": 2,
  "hostname": "169.254.7.197",
  "context": "sqsHandler",
  "sqsMessageId": "cbef03c3-be75-4624-8250-a2f28167df75",
  "function": "provisionNewMember",
  "err": {
    "name": "EntraGroupError",
    "message": "Could not add user to group"
  },
  "msg": "Failed to process SQS message"
}

Webhook Processing Issues

If a webhook (e.g., Stripe) isn't processing correctly:

  1. Find the webhook event ID in the external service (e.g., Stripe dashboard)
  2. Search logs for that event ID using: { $.metadata.initiator = "Stripe Event ID Here" }
  3. Check if the webhook was received by the API
  4. Verify if an SQS message was created
  5. Trace the message processing as described above

Best Practices for Log Analysis

  1. Use JSON filters: Always use structured JSON queries like { $.reqId = "Request ID Here" } instead of simple text searches for more accurate results.

  2. Start with the request ID: Always begin tracing from the client-facing request ID when available.

  3. Look for error chains: When you find an error, check for earlier errors or warnings that might have led to it.

  4. Check surrounding context: Filter logs around the timestamp of an error to find related events that might not share the same request ID.

  5. Follow the trail: Pay attention to how a request flows through different components (API → SQS → Lambda) by following the correlation IDs.

  6. Use time window filtering: When dealing with high-volume logs, narrow down by time first, then apply more specific filters.

  7. Correlate SQS metrics with logs: When investigating SQS issues, always check both the queue metrics and the related log messages to get a complete picture of the problem.