Architecting Edge Server High Availability Pairs with Automatic Health Check Failover
What This Guide Covers
This guide details the architectural design, deployment sequence, and failover validation for Genesys Cloud Private Edge High Availability (HA) pairs. By the end of this document, you will have a redundant Edge deployment where the secondary instance automatically assumes the primary role upon detection of a health check failure, ensuring zero downtime for SIP trunks and PSTN connectivity.
Prerequisites, Roles & Licensing
- Licensing: Genesys Cloud Private Edge license (included with CX 1/2/3 licenses, but requires explicit Edge seat allocation).
- Roles & Permissions:
Telephony > Trunk > Edit(to configure Trunk Groups)Telephony > Edge > Edit(to manage Edge instances)Infrastructure > Server > Edit(for underlying VM management)Telephony > Routing > Edit(if configuring local routing dependencies)
- Infrastructure Requirements:
- Two identical virtual machines (VMs) or bare metal servers meeting Genesys Edge hardware specifications (minimum 4 vCPUs, 16GB RAM, 100GB storage recommended for production loads).
- Static IP addresses assigned to both Edge instances.
- A floating IP address (VIP) managed by an external load balancer or keepalived configuration, OR DNS-based failover capability if using Genesys Cloud’s built-in health check routing (recommended for simplicity).
- Network connectivity: Both Edges must have outbound access to Genesys Cloud endpoints (ports 443, 5061, 5062, 5063, 5064, 5065, 5066, 5067, 5068, 5069, 5070, 5071, 5072, 5073, 5074, 5075, 5076, 5077, 5078, 5079, 5080) and inbound access from your PSTN carrier.
- External Dependencies: PSTN Carrier with SIP trunking capability.
The Implementation Deep-Dive
1. Deploying and Registering the Primary Edge Instance
The foundation of an HA pair is two identical, independently registered Edge instances. Do not attempt to cluster them at the OS level using shared storage or database clustering. Genesys Edge is stateless regarding call data; state is held in the Genesys Cloud platform. Each Edge must be fully functional on its own.
Step 1: Install and Configure the Primary Edge
Download the latest Edge installer package from the Genesys Cloud Admin Portal under Admin > Telephony > Edge. Install the software on the first VM. During installation, you will be prompted for:
- Edge Name: Use a descriptive naming convention, e.g.,
EDGE-PROD-01. - API Key: Generate a temporary API key from the Admin Portal to register the Edge.
- License Key: Input your Edge license key.
- Network Settings: Assign the static IP address. Ensure the
SIP Trunkport range is open.
Step 2: Configure SIP Trunk Endpoints
In the Genesys Cloud Admin Portal, navigate to Admin > Telephony > Trunks. Create a new SIP Trunk endpoint for the Primary Edge.
- IP Address: Enter the static IP of
EDGE-PROD-01. - Port: 5061 (TLS) or 5060 (UDP/TCP). TLS is mandatory for PCI-DSS compliance.
- Authentication: Configure username/password if required by your carrier.
The Trap: Single Point of Failure in Trunk Configuration
A common architectural error is configuring the PSTN carrier to point to a single IP address. If you configure the carrier to send traffic only to EDGE-PROD-01, and that server fails, all inbound traffic is lost. Even if the Edge fails over, the carrier does not know the new IP unless you update the carrier’s provisioning.
Architectural Reasoning for Redundant Trunk Groups
To solve this, you must configure the PSTN carrier to send traffic to a Virtual IP (VIP) or to both Edge IPs simultaneously with load balancing. However, Genesys Cloud provides a more robust mechanism: Trunk Groups with Health Checks.
When you create a Trunk in Genesys Cloud, you can associate it with a Trunk Group. A Trunk Group allows you to define multiple SIP endpoints (your Edges). Genesys Cloud continuously performs health checks against these endpoints. If the Primary Edge fails the health check, Genesys Cloud automatically routes new calls to the Secondary Edge.
2. Deploying and Registering the Secondary Edge Instance
Step 1: Install and Configure the Secondary Edge
Repeat the installation process on the second VM.
- Edge Name:
EDGE-PROD-02. - API Key: Generate a new temporary API key. Each Edge requires a unique registration.
- License Key: Use the same license key pool, but ensure you have allocated seats for both instances.
- Network Settings: Assign the second static IP address.
Step 2: Configure the Secondary SIP Trunk Endpoint
In the Genesys Cloud Admin Portal, create a second SIP Trunk endpoint.
- IP Address: Enter the static IP of
EDGE-PROD-02. - Port: Match the Primary Edge (e.g., 5061).
- Authentication: Match the Primary Edge credentials if required by the carrier.
The Trap: Mismatched Configuration
If the Primary and Secondary Edges have different SIP configurations (e.g., one uses TLS, the other uses TCP; or different authentication credentials), the failover will fail. The carrier will send traffic to the Secondary Edge, but the Edge will reject it due to protocol mismatch or authentication failure.
Architectural Reasoning for Configuration Parity
Ensure both Edges are configured identically. Copy the configuration from the Primary Edge to the Secondary Edge. This includes:
- SIP Trunk settings
- Codec preferences
- NAT traversal settings (if applicable)
- Local routing rules (if using Local Routing)
3. Configuring the Trunk Group for Automatic Failover
This is the critical step where High Availability is realized. You will group the two SIP Trunk endpoints into a single logical Trunk Group and enable health checks.
Step 1: Create a Trunk Group
Navigate to Admin > Telephony > Trunks > Trunk Groups. Click Add Trunk Group.
- Name:
TRUNK-GROUP-PROD-HA. - Description:
HA Pair for PROD Edges.
Step 2: Add SIP Endpoints to the Trunk Group
Click Add SIP Endpoint.
- Select the SIP Trunk created for
EDGE-PROD-01. - Set Priority:
1(Highest priority). - Select the SIP Trunk created for
EDGE-PROD-02. - Set Priority:
2(Lower priority).
Step 3: Enable Health Checks
In the Trunk Group settings, locate Health Check.
- Enable Health Check: Check the box.
- Health Check Interval: Set to
30seconds. This is the frequency at which Genesys Cloud probes the Edge. - Health Check Timeout: Set to
5seconds. - Failure Threshold: Set to
3consecutive failures. This prevents flapping. If an Edge fails one check, it is not immediately marked down. It must fail three consecutive checks before being removed from the active pool. - Recovery Threshold: Set to
5consecutive successes. This ensures the Edge is stable before returning it to the active pool.
The Trap: Aggressive Health Check Thresholds
Setting the Failure Threshold to 1 causes “flapping.” Network blips can cause a single health check to fail. If the Edge is removed from the pool after one failure, and then recovers instantly, it may be re-added immediately. This constant toggling causes call drops and instability. Always use a threshold of at least 3 failures.
Architectural Reasoning for Priority-Based Failover
By setting EDGE-PROD-01 to Priority 1 and EDGE-PROD-02 to Priority 2, you ensure that all traffic flows through the Primary Edge under normal conditions. The Secondary Edge remains idle, ready to accept traffic. This is a “Active-Passive” model.
If you want an “Active-Active” model, set both priorities to 1. Genesys Cloud will load balance traffic between the two Edges. However, Active-Active requires careful capacity planning to ensure neither Edge is overloaded. For most HA deployments, Active-Passive is preferred for simplicity and clear failover paths.
4. Configuring PSTN Carrier Routing
The final piece is ensuring the PSTN carrier sends traffic to the correct destination.
Option A: Virtual IP (VIP) via Load Balancer
If you have a hardware load balancer (e.g., F5, Citrix) or a software load balancer (e.g., HAProxy, Keepalived) in front of your Edges:
- Configure the load balancer to listen on the VIP.
- Configure the load balancer to forward SIP traffic to both Edge IPs.
- Configure the load balancer to perform health checks on the Edges.
- Configure the PSTN carrier to send traffic to the VIP.
Option B: Genesys Cloud Health Checks (Recommended)
If you do not have a load balancer, you can rely on Genesys Cloud’s health checks.
- Configure the PSTN carrier to send traffic to both Edge IPs simultaneously.
- Configure the carrier to use “Round Robin” or “Weighted” load balancing.
- Genesys Cloud will handle the failover logic. If the Primary Edge fails, Genesys Cloud will stop sending new calls to it. However, the carrier may still send calls to the failed Edge.
The Trap: Carrier-Side Failover Latency
If you rely on the carrier to detect the failure of the Primary Edge, there is a significant delay. Carriers typically perform health checks every 60-120 seconds. During this time, calls will be dropped.
Architectural Reasoning for Genesys Cloud-Side Failover
By using Genesys Cloud’s Trunk Group health checks, you achieve sub-minute failover. Genesys Cloud checks every 30 seconds. After 3 failures (90 seconds), the Edge is marked down. New calls are routed to the Secondary Edge. Existing calls on the failed Edge will drop, but new calls will be preserved.
To minimize call drops, ensure your carrier supports SIP OPTIONS or SIP PING health checks. Configure the carrier to send frequent health checks to the Edges. If the carrier detects a failure, it can stop sending traffic to the failed Edge immediately.
5. Validating the Failover Process
You must test the failover process in a controlled environment before going to production.
Step 1: Simulate a Failure
On EDGE-PROD-01, stop the Edge service.
sudo systemctl stop genesys-edge
Step 2: Monitor Health Checks
In the Genesys Cloud Admin Portal, navigate to Admin > Telephony > Trunks > Trunk Groups. Select TRUNK-GROUP-PROD-HA.
- Observe the Status column for the SIP Endpoints.
- The Primary Edge (
EDGE-PROD-01) should change from Healthy to Unhealthy after 3 consecutive failures (approximately 90 seconds). - The Secondary Edge (
EDGE-PROD-02) should remain Healthy.
Step 3: Test Call Routing
Place a test call to your SIP Trunk number.
- The call should connect successfully.
- Verify that the call is handled by
EDGE-PROD-02. You can check the Call Logs in Genesys Cloud to see which Edge processed the call.
Step 4: Restore the Primary Edge
On EDGE-PROD-01, start the Edge service.
sudo systemctl start genesys-edge
Step 5: Monitor Recovery
- Observe the Status column for the Primary Edge.
- It should change from Unhealthy to Healthy after 5 consecutive successes (approximately 150 seconds).
- New calls will begin to route to
EDGE-PROD-01again, as it has higher priority.
The Trap: Stale State in Carrier Systems
Some carriers cache SIP registration states. Even if Genesys Cloud marks the Primary Edge as healthy, the carrier may still be sending traffic to the Secondary Edge because it has not re-registered.
Architectural Reasoning for Re-Registration
When the Primary Edge restarts, it sends a SIP REGISTER message to the Genesys Cloud platform. Genesys Cloud updates its internal state. However, the carrier does not know this. To force the carrier to re-register, you can:
- Restart the SIP Trunk service on the Primary Edge.
- Configure the carrier to send frequent SIP REGISTER requests.
- Use a load balancer to manage the registration state.
Validation, Edge Cases & Troubleshooting
Edge Case 1: Split-Brain Scenario
The Failure Condition
The network between the two Edges and the Genesys Cloud platform is partitioned. Both Edges believe they are the primary, and both attempt to handle traffic.
The Root Cause
This occurs if the health check mechanism is bypassed or if the carrier is configured to send traffic to both Edges without a central arbiter.
The Solution
Genesys Cloud acts as the central arbiter. As long as both Edges can reach Genesys Cloud, the platform will determine which Edge is healthy. If both Edges lose connectivity to Genesys Cloud, they will stop processing new calls. To prevent split-brain, ensure that the network path between the Edges and Genesys Cloud is highly available. Use multiple internet connections or a dedicated MPLS link.
Edge Case 2: Asymmetric Routing
The Failure Condition
Inbound calls arrive on EDGE-PROD-01, but outbound calls are routed through EDGE-PROD-02. This causes SIP signaling mismatches and call failures.
The Root Cause
This occurs if the carrier sends inbound traffic to one Edge, but Genesys Cloud routes outbound traffic to the other Edge due to load balancing or failover.
The Solution
Configure Genesys Cloud to use SIP Trunk Affinity. This ensures that calls are routed through the same Edge for both inbound and outbound signaling. In the Trunk Group settings, enable Affinity. This binds the call flow to a specific Edge instance.
Edge Case 3: Health Check False Positives
The Failure Condition
The Primary Edge is healthy, but Genesys Cloud marks it as unhealthy due to network latency or packet loss.
The Root Cause
The health check interval or timeout is too aggressive for the network conditions.
The Solution
Increase the Health Check Interval and Timeout values. If the network is unstable, consider increasing the Failure Threshold to 5 or 10. This ensures that transient network issues do not trigger a failover.