Implementing Secondary Carrier Activation Procedures for Primary Provider Outage Response

Implementing Secondary Carrier Activation Procedures for Primary Provider Outage Response

What This Guide Covers

You will configure a deterministic failover routing architecture that automatically or programmatically shifts inbound SIP traffic to a secondary carrier when the primary provider fails. The end state is a validated trunk health monitoring pipeline, an API-driven activation workflow, and an Architect routing matrix that guarantees sub-thirty-second cutover without DID reassignment or media path degradation.

Prerequisites, Roles & Licensing

  • Licensing Tier: Genesys Cloud CX 2 or CX 3 (SIP Trunking, Advanced Routing, and Telephony features required). NICE CXone requires Core Telephony or Advanced Voice Routing add-on.
  • Granular Permissions: Telephony > Trunk > Edit, Telephony > Route > Edit, Architect > Flow > Edit, Integrations > API > Create/Manage, Telephony > Trunk Group > Edit
  • OAuth Scopes: telephony:trunk:write, telephony:route:write, architect:flow:read, telephony:health:read
  • External Dependencies: Secondary SIP carrier with provisioned Shared Number Authority (SNA) routing or DID porting capability, external health check orchestrator (AWS Lambda, Azure Function, or internal middleware), TLS/SRTP certificate management pipeline, and a monitoring dashboard capable of parsing SIP OPTIONS responses.

The Implementation Deep-Dive

1. Architect the Trunk Health Monitoring & Failover Logic

Carrier failover is not a binary switch. It is a state machine that evaluates signaling health, media path integrity, and routing table propagation before committing to a cutover. Genesys Cloud evaluates trunk health through SIP registration status, OPTIONS ping responses, and TCP/UDP keepalive intervals. NICE CXone uses a similar SIP health engine with configurable threshold timers. You must configure the trunk group to operate in Active/Standby mode rather than Round Robin. Round Robin distributes load but provides zero deterministic failover guarantees during a partial outage.

Configure the primary trunk with a health check interval of fifteen seconds and a failure threshold of three consecutive timeouts. Set the secondary trunk to Standby with a promotion threshold of two consecutive timeouts. This asymmetry prevents flapping when the primary carrier experiences intermittent packet loss. The platform will automatically promote the secondary trunk when the primary crosses the failure threshold, but you must explicitly disable automatic demotion. Automatic demotion causes mid-call routing splits when the primary carrier recovers while active calls still traverse the secondary path.

The Trap: Relying exclusively on SIP signaling health while ignoring media path degradation. Carriers frequently drop RTP streams before tearing down SIP dialogs. If your health check only validates SIP 200 OK responses to OPTIONS requests, the platform will declare the trunk healthy while agents hear one-way audio or complete call drops. You must implement a dual-validation strategy. Configure the trunk to send SIP OPTIONS with a minimal media payload, or deploy an external SIP monitor that validates both control plane responses and RTP connectivity. In Genesys Cloud, enable Media Path Validation on the trunk object and set the RTP Timeout to four seconds. In CXone, enable SIP Media Health Check in the carrier profile. This forces the platform to verify that a UDP/TCP stream can actually establish before considering the trunk available.

Architectural Reasoning: We separate health detection from routing execution. The platform handles SIP registration and OPTIONS validation, but we enforce explicit thresholds to prevent routing oscillation. By configuring asymmetric thresholds and disabling automatic demotion, we guarantee that once the secondary carrier activates, it remains active until an administrator or an automated workflow explicitly returns traffic to the primary path. This design eliminates the split-brain routing condition that occurs when the platform tries to balance traffic across a recovering but unstable primary carrier.

2. Configure the Secondary Carrier SIP Endpoint & Number Management

The secondary carrier must be provisioned with identical codec profiles, encryption standards, and NAT traversal settings as the primary. Mismatched configuration is the leading cause of failover failure. Create the secondary trunk object with the exact same Codec Order array as the primary. If the primary uses G.711u, G.729, G.711a, the secondary must match this sequence exactly. Codec mismatch forces the platform to renegotiate during failover, introducing latency that exceeds the thirty-second cutover target.

Configure TLS/SRTP with mutual authentication if your secondary carrier requires it. Upload the carrier CA certificate to the platform certificate store and reference it in the trunk TLS settings. Disable strict certificate pinning during initial testing, then enforce it in production. Configure NAT Traversal to Use External IP and set the SIP Port to match the carrier requirement. Enable Early Media only if the carrier explicitly supports it. Early media misconfiguration causes tone swallowing during failover cutover.

The Trap: Shared Number Authority (SNA) routing propagation delays. When you activate the secondary carrier, the platform routes traffic to the secondary SIP endpoint, but the carrier network may not immediately recognize the DIDs as valid for that endpoint. SNA routing tables update asynchronously across carrier networks. If you failover before the SNA tables propagate, the secondary carrier returns 404 Not Found or 480 Temporarily Unavailable. You must pre-warm the SNA routes. Work with your secondary carrier to provision all DIDs in their SNA routing table before the failover event. Validate propagation by placing test calls to the secondary trunk directly using a SIP dialer before attempting platform-level activation. In Genesys Cloud, verify SNA readiness by checking the Number Management dashboard for SNA Status: Active on each DID. In CXone, validate through the Number Administration console.

Architectural Reasoning: We treat number management as a distributed routing problem rather than a static assignment. DIDs are not owned by the platform; they are routed through carrier networks based on least-cost and availability logic. By pre-warming SNA routes and validating propagation, we eliminate the routing gap between platform activation and carrier acceptance. This approach ensures that the secondary carrier can immediately accept inbound traffic without requiring DID porting delays or manual carrier-side routing updates.

3. Build the API-Driven Activation & State Management Workflow

Manual activation through the administrative console introduces human latency and error. You must implement an automated or semi-automated activation workflow using the platform REST APIs. The workflow monitors trunk health metrics, evaluates failure thresholds, and programmatically updates routing rules to point to the secondary trunk group. This approach provides auditability, idempotency, and integration with external incident management systems.

Create an OAuth 2.0 service account with the required scopes. Deploy an orchestrator function that polls the trunk health endpoint at ten-second intervals. When the primary trunk crosses the failure threshold, the orchestrator executes a routing rule update to swap the trunk reference. The API call modifies the routing rule configuration to point to the secondary trunk group while preserving all downstream queue assignments and IVR logic.

Production API Example: Update routing rule to activate secondary trunk group

PUT /api/v2/telephony/routes/rules/{routingRuleId}
Authorization: Bearer <access_token>
Content-Type: application/json
{
  "enabled": true,
  "name": "Inbound_Customer_Failover_Rule",
  "description": "Routes to secondary trunk group during primary outage",
  "condition": {
    "type": "and",
    "expressions": [
      {
        "key": "system.caller.id",
        "operation": "exists"
      }
    ]
  },
  "action": {
    "type": "routeToTrunkGroup",
    "trunkGroupId": "secondary_trunk_group_id_here",
    "queueId": "primary_inbound_queue_id",
    "fallback": {
      "type": "routeToQueue",
      "queueId": "overflow_queue_id"
    }
  },
  "priority": 10
}

Implement idempotency keys in your orchestrator to prevent duplicate updates during network jitter. Store the previous routing state in a persistent database so the orchestrator can reverse the change when the primary carrier recovers. Set a cooling period of ninety seconds after activation before allowing demotion. This prevents the orchestrator from reacting to transient recovery spikes.

The Trap: Race conditions during API updates causing mid-call routing splits. If the orchestrator updates the routing rule while active calls are still processing in the primary trunk, the platform may attempt to route mid-call transfers or callbacks through the secondary path. This creates orphaned media streams and routing loops. You must implement a circuit breaker in the orchestrator that checks for active call counts before executing the swap. Query the GET /api/v2/analytics/queues/realtime endpoint to verify active call volume. Only execute the routing swap when active calls fall below a defined threshold. If calls remain active, queue the activation request and notify the incident management system for manual intervention.

Architectural Reasoning: We treat routing configuration as mutable state managed by an external control plane. The platform handles call processing, but the orchestrator manages state transitions. By implementing idempotency, cooling periods, and active call validation, we guarantee that routing updates occur at safe boundaries. This design eliminates the split-routing condition that occurs when the platform processes both primary and secondary traffic simultaneously during a cutover.

4. Implement Routing Matrix Swaps in Architect/Studio

The IVR flow must adapt dynamically to the active carrier. Hardcoding trunk identifiers in flow conditions causes routing failures during failover. When the secondary carrier activates, the platform routes traffic through the secondary trunk, but the Architect flow may still evaluate conditions that expect the primary trunk name. This mismatch causes calls to drop or route to error nodes.

Abstract trunk identification in your Architect flows. Use system.trunk.group.name instead of system.trunk.name. Configure conditional routing based on trunk group membership rather than individual trunk identifiers. Implement a fallback path that bypasses trunk-specific logic when the active trunk group matches the secondary identifier.

Production Architect Expression Example:

// Evaluate active trunk group for dynamic routing
var activeTrunkGroup = system.trunk.group.name;

if (activeTrunkGroup == "Primary_Carrier_Group") {
  // Route to standard IVR path
  return "Standard_IVR_Path";
} else if (activeTrunkGroup == "Secondary_Carrier_Group") {
  // Route to optimized failover path with reduced menu depth
  return "Failover_IVR_Path";
} else {
  // Fallback to overflow queue
  return "Overflow_Queue_Path";
}

Configure the failover IVR path to reduce menu depth and disable non-essential integrations. During carrier outages, network latency increases and third-party API response times degrade. Strip the IVR down to essential routing logic. Disable speech recognition, CRM lookups, and non-critical notifications in the failover path. This reduces call handling time and prevents cascade failures when external dependencies time out.

The Trap: Hardcoding carrier-specific DTMF tones or SIP headers in flow conditions. Secondary carriers often use different SIP header formats or DTMF relay methods (RFC 2833 vs. SIP INFO). If your flow expects specific SIP headers or DTMF formats, it will fail when traffic routes through the secondary carrier. You must normalize DTMF handling at the trunk group level. Configure both trunk groups to use RFC 2833 with SIP INFO fallback. Disable carrier-specific header parsing in the flow. Use platform-native variables that abstract carrier differences.

Architectural Reasoning: We treat the IVR as a stateless routing engine that adapts to the underlying transport layer. The flow does not care which carrier provides the traffic; it only cares about the destination queue and customer intent. By abstracting trunk identification and normalizing DTMF handling, we guarantee that the IVR operates identically regardless of the active carrier. This approach eliminates the configuration drift that occurs when teams optimize flows for primary carrier behavior without testing secondary carrier compatibility.

Validation, Edge Cases & Troubleshooting

Edge Case 1: SIP Flapping During Partial Outage

The Failure Condition: The platform alternates between primary and secondary trunks every thirty seconds during a carrier degradation event. Agents experience call drops, and the orchestrator logs repeated activation/deactivation cycles.

The Root Cause: Aggressive health check intervals combined with carrier load balancing. The primary carrier responds to OPTIONS pings inconsistently due to internal routing table updates. The platform interprets intermittent timeouts as complete failure, triggering failover. When the secondary trunk activates, the primary recovers, triggering demotion. This cycle repeats until the carrier stabilizes.

The Solution: Implement hysteresis timers in the health check configuration. Increase the primary trunk failure threshold to five consecutive timeouts and extend the check interval to thirty seconds. Disable automatic demotion entirely. Configure the orchestrator to require a manual approval token or a sustained recovery period of five minutes before executing demotion. This design forces the platform to commit to the secondary path during degradation events, eliminating oscillation.

Edge Case 2: DID Mismatch with Shared Number Authority

The Failure Condition: Traffic routes to the secondary trunk group, but calls return 404 Not Found or route to the carrier default IVR. The platform logs successful SIP INVITE transmission, but the carrier rejects the call.

The Root Cause: SNA routing table propagation delay. The secondary carrier has not yet updated their global routing tables to recognize the DIDs as valid for the new SIP endpoint. This occurs when failover activates before SNA pre-warming completes.

The Solution: Validate SNA propagation before activating the secondary path. Deploy a validation script that places test calls to the secondary trunk using a SIP dialer. Parse the SIP response codes. Only trigger the API routing swap when all test calls return 200 OK. If SNA propagation fails, fall back to DID porting procedures or activate a temporary vanity number routing table. Document the SNA validation step in your runbook and automate it in the orchestrator pipeline.

Edge Case 3: TLS Certificate Expiry on Secondary Trunk

The Failure Condition: The secondary trunk activates, but all inbound calls drop with 503 Service Unavailable or TLS handshake failures. The platform logs certificate validation errors.

The Root Cause: Automated certificate renewal misconfiguration. The secondary trunk TLS certificate expired, and the platform rejected incoming SIP INVITE messages due to strict certificate validation.

The Solution: Implement certificate monitoring with automated renewal alerts. Configure the trunk to use a certificate store that supports automatic rotation. Disable strict certificate pinning during initial deployment, then enforce it with a monitoring pipeline that validates certificate expiry dates. If the secondary trunk certificate expires, fall back to SRTP-only encryption temporarily while renewing the certificate. Update the trunk TLS configuration to reference the new certificate and restart the SIP endpoint.

Official References