WebRTC Media Server 503 during JMeter load test on APAC region

Stumbled on a weird bug today with the WebRTC softphone connection stability when scaling up concurrent sessions in our Singapore environment. We are using the latest Genesys Cloud Web SDK (v3.2.1) and attempting to simulate a peak load of 500 concurrent agents using JMeter. The setup involves a custom Architect flow that triggers a softphone call via the Platform API, specifically hitting POST /api/v2/filters/actions/outboundcall.

The issue manifests after roughly 150 concurrent connections are established. The initial handshake completes successfully, but the media stream fails to initialize. The browser console throws a WebRTC error: Failed to create RTCPeerConnection followed by a generic 503 Service Unavailable response from the media gateway. This is not happening in our QA environment with 50 concurrent users, which makes us suspect a regional capacity limit or a specific WebSocket throttling rule on the APAC edge.

We have checked the network traces, and the STUN/TURN server connectivity seems fine. The iceconnectionstate changes to failed almost immediately after the SDP exchange. We are also seeing a spike in 429 Too Many Requests on the signaling channel right before the 503s start appearing, which suggests the signaling layer might be choking before the media layer even gets a chance to negotiate.

Has anyone seen similar behavior when pushing the WebRTC softphone beyond 100 concurrent sessions in the APAC region? We are currently using the default WebSocket configuration provided by the SDK. Should we be increasing the maxRetries or adjusting the iceTransportPolicy? The latency to the media server is stable at around 45ms, so network jitter doesn’t seem to be the culprit here. Any insights on how to configure JMeter to handle the WebSocket keep-alives more efficiently during these high-throughput tests would be appreciated.

If you check the docs, they mention that WebRTC media server capacity is region-specific and often hits hard limits during load tests if the max_concurrent_sessions parameter in your Architect flow isn’t aligned with the underlying media server cluster limits. While 503 errors suggest a server-side refusal, they frequently stem from the client-side failing to negotiate the session within the timeout_ms window before the media server drops the connection.

Instead of relying solely on the standard outbound call action, consider implementing a pre-validation step using the /api/v2/communications/rtc/sessions endpoint to check available capacity before initiating the call. This prevents the initial connection attempt from failing at the media layer. Also, verify that your JMeter script includes a randomized jitter interval between connection attempts. Bursting 500 connections simultaneously can overwhelm the signaling server, even if the media servers have headroom. Adjusting the reconnect_attempts and reconnect_interval in the SDK configuration can also help mitigate transient 503s by allowing the client to back off and retry gracefully.

The documentation actually says WebRTC capacity limits vary by region and strict adherence to the max_concurrent_sessions parameter is non-negotiable for APAC clusters. Ensure your Architect flow respects these hard limits to avoid 503s.

This has the hallmarks of a classic capacity mismatch between the load test script and the actual WebRTC media server allocation in the APAC region. The 503 error confirms the server is rejecting the session request because the cluster limit is exceeded.

The previous suggestions about max_concurrent_sessions are correct, but the payload structure matters. Ensure the outbound call action explicitly sets the media type to WebRTC and respects the timeout.

Here is the corrected payload for the Architect action:

{
 "actionType": "outboundcall",
 "callType": "voice",
 "mediaType": "webrtc",
 "timeoutMs": 30000,
 "maxConcurrentSessions": 150
}

Setting maxConcurrentSessions to 150 aligns with the safe threshold for Singapore clusters during peak load. If you still see 503s, check the genesyscloud_webrtc_media_server settings in Terraform. The capacity might be dynamically scaled down if the underlying infrastructure reports high latency. Adjust the JMeter ramp-up time to allow the media server to provision connections gradually.