Dealing with a very strange bug here with our SIP registration stability during scheduled failover drills.
We operate 15 BYOC trunks in the Asia/Singapore region. The setup involves primary and secondary carrier connections for redundancy. Today, we initiated a controlled failover test to validate our outbound routing logic.
The issue manifests specifically when the primary trunk experiences a simulated degradation. Instead of a clean switchover to the secondary trunk, we are seeing a burst of SIP 408 Request Timeout errors.
This happens approximately 200ms after the failover trigger fires. The error logs point to the edge nodes attempting to re-register or establish new dialogs before the primary session state is fully cleared.
We are using the standard Genesys Cloud BYOC configuration. The SIP credentials are static. The outbound routing is set to “Best Effort” with a 30-second retry timeout.
The anomaly is that the 408 errors are not random. They correlate directly with the number of active concurrent calls on the primary trunk at the moment of failure. If concurrency is below 50 calls, the failover is clean. Above 100 calls, the 408 errors spike.
This suggests a resource contention issue on the edge side. Or perhaps a race condition in the SIP state machine.
We have checked the carrier logs. The primary carrier is not sending any BYE messages prematurely. The timeouts are generated internally by the Genesys Cloud edge.
Has anyone encountered similar behavior with high-concurrency failovers? We need to ensure our analytics reporting remains accurate during these events. The current timeouts are causing gaps in our real-time conversation summaries.
We are considering increasing the SIP timer values. But I want to understand the root cause first. Increasing timers might just mask the issue rather than solve it.
Any insights into how the edge handles SIP state cleanup during forced failovers would be appreciated. We are trying to maintain a 99.99% uptime SLA for our voice traffic.