Designing Carrier Failover Cascade Logic for Automatic Rerouting During Provider Outages
What This Guide Covers
This guide details the architectural implementation of dynamic carrier failover cascades within Genesys Cloud CX and NICE CXone to ensure business continuity during telephony provider outages. You will configure routing logic that detects SIP trunk health, evaluates latency thresholds, and automatically shifts inbound and outbound traffic to secondary or tertiary carriers without agent intervention or manual reconfiguration.
Prerequisites, Roles & Licensing
Genesys Cloud CX
- Licensing: CX 1 or higher (Standard telephony features are included in all CX tiers).
- Permissions:
Telephony > Trunk > ViewTelephony > Trunk > EditRouting > Flow > ViewRouting > Flow > EditAdministration > Settings > View
- External Dependencies:
- Two or more SIP trunk providers configured in the Genesys Cloud telephony settings.
- Access to the Genesys Cloud API for programmatic health checks (optional but recommended for advanced logic).
NICE CXone
- Licensing: Standard or Premium (Telephony routing features are core, but advanced scripting may require Premium).
- Permissions:
Telephony > Trunks > ManageRouting > Scripts > Create/EditAdministration > System Settings > View
- External Dependencies:
- Multiple SIP trunk configurations assigned to the same outbound route or inbound dial plan.
- Network connectivity allowing SIP OPTIONS or registration pings to carrier endpoints.
The Implementation Deep-Dive
1. Establishing the Telephony Topology and Trunk Health Baseline
Before designing the cascade logic, you must understand how the platform interprets “trunk health.” In both Genesys Cloud and NICE CXone, a trunk is not merely a static pipe; it is a stateful entity that reports status via SIP registrations, heartbeat mechanisms, and real-time call attempt metrics.
The Trap: Relying solely on SIP Registration status.
A trunk may show as “Registered” (200 OK to REGISTER) while the actual media path is degraded, or the carrier is dropping calls after the 200 OK but before the 183 Session Progress. If your failover logic only checks registration, you will route calls into a black hole, resulting in immediate disconnects for the customer.
Architectural Reasoning:
We must decouple connectivity from deliverability. Connectivity is binary (up/down). Deliverability is probabilistic (success rate). Our cascade logic must prioritize deliverability metrics over simple connectivity flags.
Genesys Cloud Configuration
- Navigate to Admin > Telephony > Trunks.
- Ensure each trunk has a distinct Trunk Name and Trunk ID.
- Enable SIP Trunk Security settings that allow for strict TLS/SRTP if required by the carrier, but note that encryption adds latency.
- In the Outbound Routing configuration, assign multiple trunks to a single Outbound Route with specific Ordering.
- Primary Trunk: Highest priority.
- Secondary Trunk: Next priority.
- Tertiary Trunk: Fallback.
Critical Setting: In Genesys Cloud, the “Failover” behavior is often implicit in the outbound route ordering. However, for inbound failover, you must configure Inbound Routes to point to multiple trunks or use a Flow to dynamically select the trunk based on availability.
NICE CXone Configuration
- Navigate to Telephony > Trunks.
- Configure each trunk with a unique Trunk Name.
- In Telephony > Outbound Routes, add multiple trunks to the route.
- Set the Failover Policy to “Round Robin” or “Sequential” depending on load balancing needs. For pure failover, use Sequential with explicit priority levels.
Critical Setting: Enable Trunk Health Monitoring in NICE CXone. This feature sends periodic SIP OPTIONS packets to the carrier. If the OPTIONS packet fails or exceeds a latency threshold, the trunk is marked as “Unhealthy” in the routing engine.
2. Implementing Dynamic Failover in Genesys Cloud Architect
Static outbound routing is insufficient for complex inbound scenarios where the caller ID or DID determines the trunk. We must use Genesys Cloud Architect to create a dynamic selection process.
The Trap: Using “Set Trunk” blocks without error handling.
If you use a Set Trunk block in a Flow and the specified trunk is down, the call fails immediately with a 503 Service Unavailable or a timeout. The Flow does not automatically try the next trunk unless you explicitly code the retry logic.
Architectural Reasoning:
We need a “Try-Catch” pattern in the Flow. We attempt to use the primary trunk. If it fails, we catch the exception and route to the secondary trunk. This requires using Call Control blocks and Exception Handling.
Step-by-Step Flow Construction
- Start Block: Receive the inbound call.
- Data Block: Define variables for trunk IDs.
primary_trunk_idsecondary_trunk_idtertiary_trunk_id
- Try Block:
- Inside the Try block, place a Set Trunk block targeting
primary_trunk_id. - Follow with the standard routing logic (Queue, Agent, etc.).
- Inside the Try block, place a Set Trunk block targeting
- Catch Block:
- Configure the Catch block to trigger on
TRUNK_UNAVAILABLEorCALL_FAILEDexceptions. - Inside the Catch block, place another Set Trunk block targeting
secondary_trunk_id. - Repeat the routing logic.
- Configure the Catch block to trigger on
- Nested Catch (Optional):
- For tertiary failover, nest another Try-Catch inside the first Catch block.
Code Snippet: JSON Payload for Trunk Selection via API
If you prefer to manage trunk selection via the Genesys Cloud API rather than Architect, you can update the outbound route configuration dynamically.
PUT /api/v2/outbound/routes/{outboundRouteId}
Content-Type: application/json
Authorization: Bearer {access_token}
{
"name": "Dynamic_Failover_Route",
"description": "Route with dynamic trunk failover",
"trunks": [
{
"id": "primary_trunk_uuid",
"order": 1
},
{
"id": "secondary_trunk_uuid",
"order": 2
},
{
"id": "tertiary_trunk_uuid",
"order": 3
}
],
"enabled": true
}
Note: This API call updates the static order. For true dynamic failover based on real-time health, you must use the Architect Flow method described above, as the API does not support real-time health-based reordering without a backend middleware service.
3. Implementing Dynamic Failover in NICE CXone Studio
NICE CXone Studio uses a script-based approach. The key is to use the Telephony node to check trunk status before routing.
The Trap: Assuming trunk status is instantaneous.
Trunk health checks have a delay. If a trunk goes down, the Studio script may still see it as “Healthy” for up to 30-60 seconds depending on the heartbeat interval. This results in “Flapping” calls where some calls fail and others succeed during the transition.
Architectural Reasoning:
We must implement a “Grace Period” or “Hysteresis” in the logic. If the primary trunk fails, we do not immediately switch back to it after one successful OPTIONS ping. We require a sustained period of health (e.g., 3 consecutive successful pings) before reverting. This prevents oscillation.
Step-by-Step Studio Script
- Start Node: Begin the script.
- Telephony Node: Check the status of
Primary_Trunk.- Use the Get Trunk Status function.
- Variable:
primary_status
- Decision Node:
- Condition:
primary_status == "Healthy"ANDprimary_latency < 150ms - True Path: Route to Primary Trunk.
- False Path: Proceed to Secondary Check.
- Condition:
- Secondary Check:
- Use Get Trunk Status for
Secondary_Trunk. - Variable:
secondary_status
- Use Get Trunk Status for
- Decision Node:
- Condition:
secondary_status == "Healthy"ANDsecondary_latency < 200ms - True Path: Route to Secondary Trunk.
- False Path: Proceed to Tertiary Check or Error Handling.
- Condition:
- Error Handling:
- If all trunks are unhealthy, route to a “System Down” IVR or queue with a wait strategy that retries every 30 seconds.
Code Snippet: Studio Script Syntax for Latency Check
// Pseudo-code for NICE CXone Studio Logic
var primaryHealth = Telephony.GetTrunkHealth("Primary_Trunk_ID");
var secondaryHealth = Telephony.GetTrunkHealth("Secondary_Trunk_ID");
if (primaryHealth.IsUp && primaryHealth.Latency < 150) {
Telephony.SetTrunk("Primary_Trunk_ID");
} else if (secondaryHealth.IsUp && secondaryHealth.Latency < 200) {
Telephony.SetTrunk("Secondary_Trunk_ID");
} else {
// Fallback to Tertiary or Error
Telephony.SetTrunk("Tertiary_Trunk_ID");
}
4. Advanced: Implementing Hysteresis and Latency Thresholds
Simple up/down checks are insufficient for high-quality voice. You must configure latency thresholds to prevent routing calls to a “healthy” but “slow” trunk.
The Trap: Ignoring Jitter and Packet Loss.
A trunk may have low latency but high jitter. This causes choppy audio. Your failover logic should ideally incorporate jitter metrics if the carrier provides them via SIP headers or out-of-band monitoring.
Architectural Reasoning:
We define a “Quality Gate” for each trunk. If the primary trunk’s latency exceeds 150ms or jitter exceeds 30ms, it is considered “Degraded” even if it is “Up.” The cascade logic should treat “Degraded” as “Down” for failover purposes.
Genesys Cloud Implementation
- In Admin > Telephony > Trunks, enable Quality Monitoring if available via the carrier integration.
- In the Architect Flow, use a Data Block to fetch real-time trunk metrics via the API.
- Use a Set Variable block to store the latency value.
- Use a Decision Block to compare the latency against the threshold.
API Endpoint for Trunk Metrics:
GET /api/v2/telephony/trunks/{trunkId}/metrics
JSON Response Example:
{
"id": "trunk_uuid",
"metrics": {
"latency": {
"value": 120,
"unit": "ms"
},
"jitter": {
"value": 15,
"unit": "ms"
},
"packetLoss": {
"value": 0.001,
"unit": "ratio"
}
}
}
NICE CXone Implementation
- Configure Trunk Health Profiles in Telephony > Trunks.
- Set the Latency Threshold to 150ms.
- Set the Jitter Threshold to 30ms.
- The Studio script will automatically reflect these thresholds in the
GetTrunkHealthfunction.
Validation, Edge Cases & Troubleshooting
Edge Case 1: The “Flapping” Failover
The Failure Condition:
The primary trunk experiences intermittent packet loss. The failover logic switches to the secondary trunk, then back to the primary, then back to the secondary, causing calls to be dropped or agents to receive disconnected calls.
The Root Cause:
The health check interval is too short, and the hysteresis (grace period) is too low. The system reacts to transient network blips rather than sustained outages.
The Solution:
Increase the health check interval to 10-15 seconds. Implement a “Cooldown Period” in the logic. Once a trunk is marked as “Down,” it must remain in the “Down” state for at least 60 seconds before being eligible for re-evaluation. In Genesys Cloud, this can be achieved by using a Wait block or a Timer in the Flow before re-checking the primary trunk. In NICE CXone, configure the Trunk Health Profile to require multiple consecutive successful pings before marking the trunk as “Healthy.”
Edge Case 2: The “Silent” Failover Failure
The Failure Condition:
The primary trunk is down, but the failover logic does not trigger. Calls continue to fail on the primary trunk.
The Root Cause:
The SIP Registration is still active (200 OK), but the call path is broken. The health check only verifies registration, not call deliverability.
The Solution:
Implement “Call Probing.” Use a scheduled task or a separate Flow to make a test call to a dummy endpoint on the carrier network every 30 seconds. If the test call fails, mark the trunk as “Down” in a database or variable that the main routing Flow references. In Genesys Cloud, you can use Interaction Attributes to store the trunk status. In NICE CXone, use Data Nodes to persist the status.
Edge Case 3: Outbound Route Mismatch
The Failure Condition:
Inbound calls fail over correctly, but outbound calls from agents continue to use the primary trunk, which is down.
The Root Cause:
Inbound and outbound routing are often configured separately. The failover logic was only applied to the Inbound Route or Flow, not the Outbound Route.
The Solution:
Ensure that the Outbound Route configuration also includes the failover cascade. In Genesys Cloud, check the Outbound Routing settings to ensure multiple trunks are listed with correct ordering. In NICE CXone, verify that the Outbound Route includes all trunks and that the Failover Policy is set correctly. Additionally, ensure that agents are using the correct Outbound Route in their User Settings.