EventBridge batch processing logic for CXone interaction events

Running into a wall with our EventBridge consumer for high-volume interaction events. We are getting a spike in Messaging and Voice events during peak hours, around 2000 events per second. The current Lambda function processes one event at a time, which is causing throttling and hitting the concurrency limit quickly. We need to process these in batches, but the EventBridge rule sends individual records to the Lambda.

We tried using the batchSize parameter in the EventBridge target configuration, but it seems CXone events are still arriving as single-item arrays in most cases, or the batch size is too small to make a difference. The Lambda timeout is set to 15 seconds, but processing 10 events takes about 3 seconds, so we are still inefficient.

Here is the current Lambda handler structure:

def lambda_handler(event, context):
 for record in event.get('Records', []):
 detail = record['detail']
 interaction_id = detail.get('interactionId')
 # Process interaction
 update_crm(interaction_id)
 return {'statusCode': 200}

The error we see in CloudWatch is Task timed out after 15.00 seconds. We need to process these events without dropping them. Is there a way to configure the EventBridge rule to send larger batches from CXone? Or should we be using SQS as a buffer and then polling from Lambda?

We have tried increasing the Lambda concurrency limit, but the cost is getting too high. We need a more efficient way to handle the throughput. The events contain sensitive data, so we cannot use a public endpoint. We are using IAM roles for authentication.

Any code examples for batch processing EventBridge records in Python? We need to ensure we don’t miss any events. The current setup is not scalable. We are losing data during peak times. Need a solution that can handle at least 5000 events per second. The update_crm function is the bottleneck. It makes a REST call to our internal API. We are thinking of using asyncio to parallelize the calls within the Lambda, but we are not sure how to structure it with the EventBridge batch input.

Also, how do we handle partial failures in a batch? If one event fails, do we retry the whole batch? We need to know the best practice for error handling in this scenario. We don’t want to lose any interaction data. The CXone API is rate-limited, so we need to respect that too. We are using the Python SDK for the internal API calls. The response time varies from 200ms to 2s. This variability is causing the Lambda to timeout. We need a consistent way to process these events. Any help would be appreciated. We are stuck on this for weeks.