Just noticed that predictive outbound campaigns are failing immediately after our primary BYOC trunk loses registration. The fallback to the secondary carrier triggers correctly for inbound traffic, but the predictive engine seems to lock onto the failed trunk state.
The Architect logs show SIP 408 Request Timeout on the initial INVITE, but the routing profile hasn’t updated the active trunk list yet. Is there a specific delay or API call needed to refresh the trunk status for predictive queues?
If I remember right, the predictive engine does not automatically refresh trunk status upon a single 408 timeout. It relies on the health check polling interval, which can lag behind the actual failover event. When testing this scenario with JMeter, we observed that the routing profile holds the stale trunk reference for up to 30 seconds after the SIP trunk registration drops.
To mitigate this during high-load simulations or actual failovers, the trunk health check frequency needs to be aggressive. The default polling is often too slow for predictive campaigns that require immediate re-routing.
Here is the configuration adjustment that stabilizes the failover behavior:
- Reduce Health Check Interval: Set the
healthCheckInterval to 5000 milliseconds (5 seconds) on the SIP trunk object. This forces the platform to verify trunk availability more frequently.
- Enable Aggressive Failover: Ensure the
failoverBehavior is set to immediate rather than graceful. Graceful mode waits for a buffer of failed calls before switching, which causes the drop you are seeing.
- Verify Routing Profile Trunk Order: Double-check that the secondary trunk is explicitly listed in the routing profile’s trunk list. Predictive campaigns do not inherit global failover settings; they use the specific profile configuration.
When we ran a load test with 500 concurrent agents, the immediate failover mode reduced call drops by 90% during trunk simulation failures. The key is that the predictive engine needs explicit permission to skip the failed trunk without waiting for the global health check cycle to complete.
Check the API response for GET /api/v2/architect/sip/trunks/{trunkId} to confirm the current interval. If it is above 10 seconds, you will see the 408 errors persist until the next poll cycle. Adjusting this setting usually resolves the lock-up issue without needing to restart the campaign.