Anyone know why the predictive routing engine seems to ignore the configured queue depth limits when we trigger a manual failover on our byoc trunks? we are running version 2024.1.1 of the platform, managing fifteen byoc trunks across asia pacac regions. the issue is specific to the sg-byoc-04 carrier path. when we force a sequential failover to the secondary carrier, the sip registration state flips to registered within the ten-second timeout we have set in the trunk_failover_policy. however, the predictive routing queue depth for the associated skill group spikes to three hundred percent of the configured threshold before stabilizing. this causes a cascade of abandoned calls because the predictive algorithm continues to dial outbound agents based on the stale queue metrics from the primary carrier. the api call to get /v2/analytics/reporting/queues shows a latency of four hundred milliseconds during the failover window, which is outside the normal two hundred millisecond variance. we have verified that the sip credentials are valid and the outbound routing rules are correctly prioritized. the issue persists even when we disable the credential_rotation setting and set it to static. we suspect there is a race condition between the trunk registration heartbeat and the predictive routing calculation engine. the logs show a warning about concurrent session limits being exceeded, but we have increased the limit to five hundred sessions per trunk. has anyone seen this behavior with other carriers? we are looking for a workaround to pause the predictive routing engine during the failover window to prevent the queue depth anomaly. any insights into the internal timing of the predictive routing recalibration would be appreciated. we need to ensure that the queue depth metrics are accurate before the engine resumes dialing agents. the current behavior is causing significant churn in our agent workforce due to the high abandonment rate during these brief failover events.
{
"routing": {
"queue_depth_check_interval_ms": 1000
}
}
Check your routing policy interval settings in the admin console. Predictive routing often caches state longer than the SIP registration flip, so reducing the check interval helps align the two during failover.
Check your Terraform state for the genesyscloud_routing_emergency_routing_group and associated trunk failover policies. The issue often stems from a mismatch between the SIP registration timeout and the predictive routing cache refresh cycle. When the secondary carrier registers, the routing engine may still hold stale queue depth metrics from the primary path.
Adding a custom data action to flush the routing cache during failover events can help. Alternatively, adjust the queue_depth_check_interval_ms in the routing policy as suggested above, but ensure it aligns with your trunk failover timeout. A 1000ms interval might be too aggressive for high-volume queues, causing unnecessary API calls.
Consider implementing a GitHub Actions workflow to monitor trunk status and trigger a manual cache refresh via the Genesys Cloud CLI. This ensures the predictive engine sees the new carrier path immediately. Validate the configuration in a non-production environment first to avoid impacting live agents.