Trying to understand why our primary SIP trunk registration is experiencing intermittent flapping specifically during peak inbound call volumes, resulting in a 408 Request Timeout error from the Genesys Cloud SIP endpoint. The environment consists of Genesys Cloud Platform (v2023.11) integrated with an Avaya Aura Session Border Controller acting as the SIP trunk provider. The SBC is configured with a keep-alive interval of 30 seconds, matching the default Genesys Cloud SIP trunk settings.
The issue manifests when inbound call volume exceeds 500 concurrent calls. At this threshold, the SBC logs show that the Genesys Cloud side fails to respond to the OPTIONS keep-alive messages, timing out after 30 seconds. Consequently, the SBC marks the trunk as unreachable and attempts to re-register. During this re-registration window, any new inbound calls are dropped with a 503 Service Unavailable response. The Genesys Cloud SIP trunk health dashboard shows the status oscillating between ‘Available’ and ‘Unavailable’ every 45-60 seconds during these spikes.
We have verified the network path using Wireshark captures on the SBC. The OPTIONS requests are leaving the SBC, but the TCP connection seems to be reset by the Genesys Cloud load balancer before the response is fully transmitted. This suggests a potential resource exhaustion issue on the Genesys Cloud SIP termination servers rather than a network latency problem. We have already increased the jitter buffer on the SBC and adjusted the SIP timeout values to 60 seconds, but the 408 errors persist.
Has anyone encountered similar behavior with high-throughput SIP trunks? Are there specific Genesys Cloud configuration limits for concurrent SIP dialogues that might be triggering this? We are considering implementing a secondary trunk for redundancy, but we need to understand the root cause first to avoid cascading failures. Any insights into the internal load balancing logic for SIP keep-alives would be appreciated.
Thanks for the help.