You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(control-plane): add support for handling multiple events in a single invocation
Currently we restrict the `scale-up` Lambda to only handle a single
event at a time. In very busy environments this can prove to be a
bottleneck: there are calls to GitHub and AWS APIs that happen each
time, and they can end up taking long enough that we can't process
job queued events faster than they arrive.
In our environment we are also using a pool, and typically we have
responded to the alerts generated by this (SQS queue length growing) by
expanding the size of the pool. This helps because we will more
frequently find that we don't need to scale up, which allows the lambdas
to exit a bit earlier, so we can get through the queue faster. But it
makes the environment much less responsive to changes in usage patterns.
At its core, this Lambda's task is to construct an EC2 `CreateFleet`
call to create instances, after working out how many are needed. This is
a job that can be batched. We can take any number of events, calculate
the diff between our current state and the number of jobs we have,
capping at the maximum, and then issue a single call.
The thing to be careful about is how to handle partial failures, if EC2
creates some of the instances we wanted but not all of them. Lambda has
a configurable function response type which can be set to
`ReportBatchItemFailures`. In this mode, we return a list of failed
messages from our handler and those are retried. We can make use of this
to give back as many events as we failed to process.
Now we're potentially processing multiple events in a single Lambda, one
thing we should optimise for is not recreating GitHub API clients. We
need one client for the app itself, which we use to find out
installation IDs, and then one client for each installation which is
relevant to the batch of events we are processing. This is done by
creating a new client the first time we see an event for a given
installation.
We also remove the same `batch_size = 1` constraint from the `job-retry`
Lambda and make it configurable instead, using AWS's default of 10 for
SQS if not configured. This Lambda is used to retry events that
previously failed. However, instead of reporting failures to be retried,
here we maintain the pre-existing fault-tolerant behaviour where errors
are logged but explicitly do not cause message retries, avoiding
infinite loops from persistent GitHub API issues or malformed events.
Tests are added for all of this.
Copy file name to clipboardExpand all lines: README.md
+2Lines changed: 2 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -155,6 +155,8 @@ Join our discord community via [this invite link](https://discord.gg/bxgXW8jJGh)
155
155
| <aname="input_key_name"></a> [key\_name](#input\_key\_name)| Key pair name |`string`|`null`| no |
156
156
| <aname="input_kms_key_arn"></a> [kms\_key\_arn](#input\_kms\_key\_arn)| Optional CMK Key ARN to be used for Parameter Store. This key must be in the current account. |`string`|`null`| no |
157
157
| <aname="input_lambda_architecture"></a> [lambda\_architecture](#input\_lambda\_architecture)| AWS Lambda architecture. Lambda functions using Graviton processors ('arm64') tend to have better price/performance than 'x86\_64' functions. |`string`|`"arm64"`| no |
158
+
| <aname="input_lambda_event_source_mapping_batch_size"></a> [lambda\_event\_source\_mapping\_batch\_size](#input\_lambda\_event\_source\_mapping\_batch\_size)| Maximum number of records to pass to the lambda function in a single batch for the event source mapping. When not set, the AWS default of 10 events will be used. |`number`|`10`| no |
159
+
| <aname="input_lambda_event_source_mapping_maximum_batching_window_in_seconds"></a> [lambda\_event\_source\_mapping\_maximum\_batching\_window\_in\_seconds](#input\_lambda\_event\_source\_mapping\_maximum\_batching\_window\_in\_seconds)| Maximum amount of time to gather records before invoking the lambda function, in seconds. AWS requires this to be greater than 0 if batch\_size is greater than 10. Defaults to 0. |`number`|`0`| no |
158
160
| <aname="input_lambda_principals"></a> [lambda\_principals](#input\_lambda\_principals)| (Optional) add extra principals to the role created for execution of the lambda, e.g. for local testing. | <pre>list(object({<br/> type = string<br/> identifiers = list(string)<br/> }))</pre> |`[]`| no |
159
161
| <aname="input_lambda_runtime"></a> [lambda\_runtime](#input\_lambda\_runtime)| AWS Lambda runtime. |`string`|`"nodejs22.x"`| no |
160
162
| <aname="input_lambda_s3_bucket"></a> [lambda\_s3\_bucket](#input\_lambda\_s3\_bucket)| S3 bucket from which to specify lambda functions. This is an alternative to providing local files directly. |`string`|`null`| no |
0 commit comments