We’ve got a massive spike in interaction events hitting our EventBridge bus. The downstream Lambda that processes these is consistently hitting the 1000 concurrent execution limit, causing dropped events and failed updates in our custom analytics DB. I know EventBridge has retry logic, but the exponential backoff isn’t helping because the volume is just too high and sustained.
Here’s the basic handler structure:
def lambda_handler(event, context):
for record in event['records']:
interaction_id = record['detail']['interactionId']
# Heavy processing here
update_analytics(interaction_id)
The issue is that update_analytics takes about 200ms per call. With 5000 events/sec, we’re dead in the water. I’ve tried increasing the batch size in the EventBridge destination config, but that just increases the payload size and doesn’t solve the concurrency bottleneck per Lambda instance.
- Lambda provisioned concurrency: 500
- EventBridge destination: Batch size 100
- Error rate: ~15% during peak hours (14:00-16:00 CET)
What’s the best way to handle this without rewriting the whole pipeline? Should I be pushing to SQS first and then processing from there, or is there a setting I’m missing in the EventBridge integration?