How does SIP keepalive backoff actually work on BYOC trunk failover?

LazyCoder · April 29, 2026, 7:18pm

How does the SIP keepalive retry logic actually behave when a BYOC trunk hits a carrier-side NAT timeout? The Ohio BYOC pool dropped into failover mode around 2 PM ET yesterday. Primary carrier’s session border controller started dropping OPTIONS packets after roughly 90 seconds of idle time. Console shows a cascade of SIP 408 Request Timeout followed by a hard SIP 503 Service Unavailable. Architect v2024.3.1 is routing outbound calls through the fl-8821-out flow. The trunk group has the Use failover on error toggle switched on, but the secondary Twilio trunk never catches the traffic. Instead, the platform keeps hammering the primary with REGISTER refreshes every 15 seconds. Outbound queue backed up for three hours while agents watched calls bounce to voicemail.

Checked the trunk group settings via GET /api/v2/architect/trunkgroups/ohio-byoc-01. The keepalive interval is set to 30 seconds, carrier expects 60. Docs say it’s supposed to scale the retry window, but the logs show a flat 15-second cadence. Outbound routing rules point to the correct trunk group, but the failover trigger seems stuck on the 408 instead of waiting for the 503. Tried adjusting the SIP URI formatting in the carrier portal. Nothing changed. The secondary trunk just sits there doing jack all with a healthy 200 OK on its own REGISTER cycle. Console metrics flatline. You’ll notice the platform doesn’t even attempt a re-route.

Just need to know if the keepalive backoff is hardcoded or if there’s a hidden parameter in the trunk group payload that controls the multiplier. Here’s the raw REGISTER header dump from the last failed cycle:

REGISTER sip:34.210.xx.xx SIP/2.0
To: sip:byoc-ohio-01@34.210.xx.xx
From: sip:byoc-ohio-01@34.210.xx.xx;tag=gbk7721
Call-ID: 8842f1a2@34.210.xx.xx
CSeq: 4 REGISTER
Contact: sip:34.210.xx.xx:5060
Expires: 15

QueueBreaker · April 29, 2026, 8:08pm

Not my lane. I schedule shifts, not SIP packets. Check the trunk group keepalive interval in Admin.

SignalSentry · May 1, 2026, 8:08pm

The point above is correct that the Admin UI is the first place to check, but it doesn’t show you the actual backoff curve. Since I manage the routing rules, I usually just pull the trunk group config via API to see the exact keepalive settings. You can’t really see the exponential backoff logic in the console, but you can verify the interval.

Here’s how I check the current configuration for that specific trunk group:

from purecloud_platform_client import PureCloudPlatformClientV2

client = PureCloudPlatformClientV2('your_region')
routing_client = client.RoutingApi()

# Get the trunk group to inspect keepalive settings
trunk_group = routing_client.get_routing_trunkgroup("your_trunk_group_id")
print(f"Keepalive interval: {trunk_group.keepalive_interval}")
print(f"Keepalive enabled: {trunk_group.keepalive_enabled}")

The default is usually 30 seconds, but if your carrier drops OPTIONS after 90s, you might need to increase the interval or disable keepalives entirely to stop the 408 storms. It’s less about “backoff” and more about matching the carrier’s idle timeout. I’d suggest setting it to 120s or turning it off if the carrier supports RFC 5626.