EventBridge consumer Lambda throttling on Genesys interaction events

We’ve got a Kotlin Lambda function subscribing to Genesys Cloud interaction events via EventBridge. The goal is to process high-volume web messaging start/stop events for our Android app analytics. We’re hitting the concurrency limit hard during peak hours (CT timezone).

The Lambda is configured with a reserved concurrency of 100. When the event rate spikes, we start seeing ProvisionedConcurrencyTooHigh or just standard throttling errors. The function timeout is set to 30s, but most executions finish in <200ms. The issue seems to be the burst rate from EventBridge overwhelming the provisioned capacity.

Here’s the relevant part of the Lambda handler in Kotlin:

fun handleEvent(event: EventBridgeEvent<List<GenesysEvent>, String>): String {
 val batch = event.detail
 batch.forEach { evt ->
 if (evt.type == "webmessaging:conversation:start") {
 // process start
 log.info("Processing start event: ${evt.id}")
 analyticsService.recordStart(evt)
 } else if (evt.type == "webmessaging:conversation:end") {
 // process end
 analyticsService.recordEnd(evt)
 }
 }
 return "processed"
}

The EventBridge rule is set to send batches of up to 10 events. We’re seeing batches arrive faster than the Lambda can scale up, even with auto-scaling enabled. The CloudWatch logs show a mix of successful invocations and ThrottlingException errors.

2024-05-20T14:30:12.123Z ERROR: ThrottlingException: Rate exceeded for function arn:aws:lambda:us-east-1:123456789012:function:gc-event-consumer
2024-05-20T14:30:12.456Z INFO: Processing start event: conv-12345
2024-05-20T14:30:12.789Z ERROR: ThrottlingException: Rate exceeded for function arn:aws:lambda:us-east-1:123456789012:function:gc-event-consumer

We’ve tried increasing the reserved concurrency, but that feels like a band-aid. Is there a way to configure the EventBridge target to retry with exponential backoff specifically for Lambda? Or should we be using a SQS queue as a buffer between EventBridge and the Lambda? The documentation mentions retryPolicy on the target, but I’m not seeing how to set it via the AWS SDK for Kotlin or Terraform.

Also, is it worth switching to a SQS FIFO queue to preserve order? We don’t strictly need order for analytics, but we do need to avoid duplicate processing if the Lambda fails. The GenesysEvent payload includes an id field that could be used for idempotency, but handling duplicates in the Lambda logic adds complexity.

Any ideas on how to tune the EventBridge target settings or the Lambda concurrency to handle this burst? We’re looking at ~500 events/minute during peak, which seems manageable but isn’t.