Configuring SIP OPTIONS Keepalive for Proactive Trunk Health Monitoring in Genesys Cloud CX

Configuring SIP OPTIONS Keepalive for Proactive Trunk Health Monitoring in Genesys Cloud CX

What This Guide Covers

This guide details the configuration of SIP OPTIONS Keepalive probes on a Genesys Cloud SIP Trunk to detect carrier connectivity loss before active call attempts are made. You will configure the trunk keepalive interval and failure thresholds within the platform settings and integrate these state changes with Event Streams for real-time alerting. The end result is a telephony infrastructure that transitions trunks to a down state automatically upon signal degradation, preventing inbound call failures and reducing busy tone errors during network instabilities.

Prerequisites, Roles & Licensing

Before configuring trunk health monitoring, verify the following environment requirements to ensure successful deployment and operational visibility.

Licensing Requirements

  • Genesys Cloud CX: Requires an active Talk license for SIP Trunk capabilities.
  • Advanced Telephony: Optional but recommended for granular control over trunk routing logic based on state.
  • Event Streams: Required to ingest Trunk State Change events into external monitoring systems or ticketing platforms.

Granular Permissions
Access is restricted by the following permission sets within the Administration console:

  • Telephony > Trunk > Edit: Required to modify Keepalive settings on the SIP Trunk configuration page.
  • Events > Stream > Read: Required to subscribe to Trunk State Change events via API or Event Streams dashboard.
  • Users > View: Required to verify user-level access for troubleshooting configurations.

OAuth Scopes
If automating the verification of trunk status via the Public API, the following OAuth scopes are required:

  • telephony.trunks.read
  • events.streams.read

External Dependencies

  • SIP Carrier Provider: The upstream carrier must support and respond to SIP OPTIONS requests per RFC 3261 standards. Some carriers require specific header configurations or IP whitelisting for OPTIONS traffic.
  • Network Firewall: Ensure UDP port 5060 (and 5061 for TLS) allows outbound traffic from Genesys Cloud IP ranges to the carrier endpoint without modification of SIP headers.

The Implementation Deep-Dive

1. Configuring Keepalive Interval and Failure Thresholds

The foundation of proactive health detection lies in the frequency of probe messages and the tolerance for failure before a trunk is marked as unavailable. This configuration exists within the SIP Trunk definition in the Genesys Cloud Administration console.

Navigate to Admin > Telephony > Trunks and select the target SIP Trunk. Locate the Keepalive tab or section within the Trunk settings panel. The default configuration often disables keepalives or sets them to a high interval, assuming that call attempts themselves validate connectivity. This assumption is flawed because active calls consume significant signaling resources and may fail with generic error codes rather than clear carrier outage indicators.

Configure the Keepalive Interval field. This value dictates how frequently a SIP OPTIONS request is sent to the carrier endpoint when no active calls are in progress on that trunk.

  • Recommended Value: 30 seconds.
  • Reasoning: A 30-second interval balances network overhead with detection latency. Setting this below 15 seconds increases signaling load and may trigger carrier rate limiting or IP blocking mechanisms designed to prevent denial-of-service attacks. Setting it above 60 seconds introduces a window where calls may be routed to an unavailable trunk for nearly a minute during a failure event.

Configure the Max Failed Attempts field. This defines how many consecutive OPTIONS requests must fail before the platform changes the Trunk State from “Available” to “Down”.

  • Recommended Value: 3 attempts.
  • Reasoning: This provides hysteresis against transient packet loss. A single dropped packet does not indicate a carrier outage; it indicates network jitter. Three consecutive failures confirm a sustained communication breakdown. If you set this to 1, the trunk will flake between states during minor network congestion, causing routing instability for agents attempting outbound calls or inbound call distribution logic.

The Trap: Misinterpreting Keepalive Traffic as Call Signaling
A common misconfiguration involves assuming that OPTIONS traffic follows the same routing rules as INVITE traffic. In Genesys Cloud, Keepalive traffic is generated from the platform’s core signaling layer and does not necessarily follow the specific DID routing or outbound route patterns defined for active calls. If your carrier requires a specific “From” header or specific Contact header modification for OPTIONS requests to be accepted, you must configure this under SIP Trunk > Advanced Settings. Failure to do so results in the carrier rejecting the OPTIONS with a 403 Forbidden or 408 Request Timeout error. The platform counts these as failures, marking the trunk as Down even though the carrier is operational. This causes a total loss of inbound capacity for that trunk without any actual network failure occurring.

2. Integrating Trunk State Changes with Event Streams

Detecting the state change is insufficient if the operations team remains unaware. You must configure Event Streams to ingest the telephony.trunks.stateChange event type. This allows downstream systems such as PagerDuty, ServiceNow, or a custom webhook to react immediately when a trunk transitions from Available to Down.

Navigate to Admin > Events > Event Streams and create a new stream subscription.

  • Event Type: telephony.trunks.stateChange
  • Filter: You may filter by specific Trunk ID if monitoring multiple trunks in parallel.
  • Destination: Select your preferred HTTP endpoint or message queue.

The payload for this event contains critical metadata regarding the state transition. Below is a representative JSON payload structure observed during a trunk failure simulation.

{
  "eventType": "telephony.trunks.stateChange",
  "timestamp": 1678902451230,
  "entityId": "trunk-id-abc-123",
  "entityName": "Primary-PSTN-Trunk",
  "state": "DOWN",
  "previousState": "AVAILABLE",
  "reasonCode": "KEEPALIVE_FAILURE",
  "metadata": {
    "failedAttempts": 3,
    "lastSuccessfulKeepalive": 1678902420000
  }
}

The Trap: Ignoring the Reason Code in Payload Parsing
Many integration scripts assume all state changes are critical outages. However, Genesys Cloud also emits stateChange events for administrative actions, such as a manual disablement of the trunk or a license suspension. If your alerting logic treats every DOWN event identically, you will generate false positives when administrators intentionally take trunks down for maintenance. Always parse the reasonCode field in the JSON payload. Valid values include KEEPALIVE_FAILURE, ADMIN_DISABLED, and LICENSE_ISSUE. Configure your alerting threshold to trigger only on KEEPALIVE_FAILURE or NETWORK_ERROR unless you require a log of all administrative state changes for audit purposes.

3. Implementing Failover Routing Logic

Configuring the keepalive is only half the battle. The platform must know how to react when the trunk enters the Down state. In Genesys Cloud, Trunk State directly influences Outbound Route selection and Inbound Call routing logic. If you have multiple SIP Trunks configured for the same DID set or region, ensure your Outbound Routes are defined with priority ordering that accounts for trunk health.

Navigate to Admin > Telephony > Routing Rules. Ensure that the primary route points to the Trunk ID associated with your Keepalive configuration. Configure a secondary route that points to a backup trunk or carrier. The platform’s routing engine automatically evaluates the State property of the Trunk during call initiation. If the primary Trunk State is DOWN, the system attempts the secondary route without requiring manual intervention or script logic.

The Trap: Assuming Automatic Failover Covers All Call Types
A frequent architectural error involves assuming that all traffic flows through the SIP Trunk routing engine equally. For example, some carriers utilize specific SBCs for inbound traffic and different endpoints for outbound traffic. If your Keepalive is configured only on the Inbound SIP Trunk but your Outbound traffic routes through a separate SIP Endpoint configuration, the platform may mark one trunk as Down while the other remains Available. This creates a state where inbound calls fail to connect, but outbound calls succeed. To mitigate this, ensure you configure keepalives on both the Inbound and Outbound SIP Trunks if they utilize distinct signaling paths or carrier endpoints. Additionally, verify that your Call Routing logic does not have hardcoded fallback IPs that bypass the Trunk State check, as custom routing rules sometimes override standard platform health checks.

Validation, Edge Cases & Troubleshooting

Edge Case 1: Carrier Ignores OPTIONS Requests

Some legacy carriers or specific VoIP providers do not implement RFC 3261 OPTIONS handling correctly. They may silently drop the packet or respond with a non-standard header that the Genesys Cloud SIP stack does not recognize as a 200 OK.

The Failure Condition: The Trunk State remains AVAILABLE even though the carrier is unreachable via INVITE.
The Root Cause: The Keepalive mechanism relies on a specific response code. If the carrier returns no response (Timeout) or an error code, the platform registers a failure. However, if the carrier drops the packet without sending a TCP reset or UDP drop indication, the timeout logic may vary based on network latency settings.
The Solution: Perform a baseline connectivity test using a SIP client tool such as sipsak or sipdump from a machine within your network to the carrier endpoint. Send an OPTIONS request and analyze the response code. If the carrier does not return 200 OK, you must configure the Keepalive Response Code setting (if available in your specific UI version) or adjust the Timeout settings for Keepalive probes to ensure they do not falsely trigger on slow carriers. In Genesys Cloud, this often involves adjusting the global SIP timeout settings under Admin > Telephony > Global Settings rather than the individual Trunk settings, though specific carrier exceptions may require a support ticket to adjust backend signaling timers.

Edge Case 2: NAT Hairpinning and Source IP Issues

When deploying Genesys Cloud, outbound traffic originates from the platform’s infrastructure IPs. If your network firewall or carrier SBC performs strict source IP validation, it may reject OPTIONS packets that appear to come from different source ports or headers than expected during a failover scenario.

The Failure Condition: Keepalive requests succeed initially but fail intermittently after a specific period or following a system update.
The Root Cause: Network Address Translation (NAT) devices sometimes strip the Via header or modify the Contact header in SIP packets, causing the carrier to reject subsequent OPTIONS requests as malformed.
The Solution: Verify that your firewall does not perform SIP ALG (Application Layer Gateway) inspection on UDP port 5060. SIP ALG often corrupts headers required for stateful keepalive tracking. Disable SIP ALG on all intermediate firewalls between the Genesys Cloud edge and the carrier endpoint. Additionally, ensure the SIP Trunk > Advanced Settings option for Enable NAT is configured correctly based on whether the trunk connects directly to the public internet or through a private network. Incorrect NAT configuration causes the platform to send OPTIONS from the wrong source IP address, leading to immediate rejection by the carrier security policy.

Edge Case 3: State Persistence During Failover

In high-availability deployments, you may have redundant SIP Trunks configured across different Genesys Cloud regions or instances. If a network partition occurs, both instances might perceive the trunk as available when it is actually down, leading to split-brain scenarios where calls are routed inconsistently.

The Failure Condition: Calls are successfully routed to one instance while the other instance attempts to route to the same carrier but fails due to signaling state desynchronization.
The Root Cause: The Keepalive mechanism operates independently on each instance of the platform. If the network path from Instance A to the carrier is down, but Instance B still sees the carrier as reachable, the routing logic may distribute load unevenly or fail over incorrectly.
The Solution: Implement a monitoring layer that aggregates health status from all instances before making external decisions. Use the Event Streams integration mentioned earlier to ingest state changes into a centralized dashboard. If you require cross-region redundancy, configure your Outbound Routes with explicit priority levels that account for regional availability rather than relying solely on Trunk State. Ensure that the carrier endpoint itself supports anycast or load balancing that validates health at the carrier level, not just at the Genesys Cloud edge level. This ensures that if the carrier goes down globally, both instances detect the failure simultaneously.

Official References