Outbound Campaign Drop Rate Spike During BYOC Trunk Failover in APAC Region

Just noticed that our outbound dialing campaign SG-OUTBOUND-2024-Q3 is experiencing a critical drop in successful connections specifically when the primary carrier fails and traffic shifts to the secondary BYOC trunk. We are running Genesys Cloud platform version v2024.1.1 with the Outbound module enabled. The environment is configured with 15 BYOC trunks across APAC regions, utilizing sequential failover logic with a retry_interval_ms of 2000 and max_retries set to 3.

When the primary trunk (Carrier A) becomes unavailable due to SIP 408 Request Timeout errors, the system correctly initiates failover to the secondary trunk (Carrier B). However, approximately 15-20% of calls routed through the secondary trunk immediately receive a 486 Busy Here or 503 Service Unavailable response from the carrier, despite the trunk registration status showing REGISTERED and healthy in the Admin console. This behavior does not occur during normal operation on the primary trunk.

Investigating the SIP traces via the Genesys Cloud API endpoint GET /v2/architect/flows/versions, I observed that the outbound flow is not waiting for the SIP dialog to fully establish before attempting the next call leg during the failover window. The sip_reg_timeout_s is configured to 10, which should be sufficient, but the rapid succession of calls during the failover event seems to overwhelm the carrier’s SIP proxy.

Has anyone encountered similar issues with carrier-specific quirks during BYOC trunk failover in high-volume outbound campaigns? We are considering adjusting the outbound_call_attempts logic in the Architect flow to introduce a delay or implement a distributed lock mechanism similar to what was suggested for OAuth token refreshes under high load. Any insights on configuring the failover policy to better handle carrier-specific SIP dialog establishment times would be appreciated. We need to ensure that the predictive routing engine respects the carrier’s capacity limits during these transient failover states.

Make sure you verify the recording metadata payload during the failover event. The Data Action engine often struggles with concurrent writes when shifting 15 trunks. This creates a race condition where the recording job starts but the metadata is incomplete, causing the outbound engine to drop the call prematurely.