BYOC Edge Media Server 503 Service Unavailable During Peak Load Simulation

HTTP 503 Service Unavailable returned by the BYOC Edge Media Server endpoint /mcs/v1/connections.

We are running a load test for a Bring Your Own Container (BYOC) deployment in the US-East region to validate media server capacity before a major traffic spike. The Genesys Cloud platform version is 10.5. We are using JMeter 5.6.2 to simulate 2,000 concurrent SIP registrations and media streams hitting the local edge cluster.

The initial registration phase completes successfully, but when we ramp up to 80% of the theoretical capacity, the Media Server starts rejecting new connection attempts with a 503 error. The latency spikes to over 2 seconds before the failure. This happens even though the CPU and memory utilization on the Kubernetes pods are only at 40%.

We suspect this might be related to the WebSocket handshake limit or the SIP trunk capacity configuration on the edge, but the logs are not very clear.

Here is the payload we are sending for the registration:

{
 "trunkId": "byoc-trunk-prod-01",
 "mediaServerId": "ms-edge-us-east-1a",
 "connectionType": "sip",
 "maxConcurrentCalls": 500
}

Has anyone seen this specific 503 behavior during high-concurrency tests on BYOC? Are there specific rate limits we are missing in the edge configuration?

TL;DR: Monitor queue saturation and agent availability metrics before attributing 503 errors to infrastructure limits.

It depends, but generally…

While the HTTP 503 response from the /mcs/v1/connections endpoint suggests a media server capacity issue, the root cause often lies in upstream flow logic or queue configuration rather than the BYOC container itself. In enterprise environments with high concurrent load, the platform may return a 503 if the associated queue has exhausted its available agent resources or if the flow has entered a recursive loop that prevents proper session termination.

Before scaling the BYOC cluster, verify the Queue Activity view in the Performance dashboard. Look for a spike in Abandoned Calls or Service Level Breaches during the exact timeframe of the 503 errors. If the queue is saturated, the media server cannot establish new connections because the orchestration layer rejects the SIP INVITE before it reaches the media plane.

Additionally, check the Conversation Detail View for failed sessions. If the Call Direction is inbound and the Queue Name is null, the issue is likely in the Architect flow routing logic, not the media server capacity. Ensure that your flow does not contain a “Wait in Queue” action without a defined timeout or fallback path. A common misconfiguration is setting a queue timeout to zero, which causes immediate rejection under load.

If the queue metrics remain healthy (low abandon rate, high service level), then investigate the BYOC container logs for OOMKilled events. However, based on typical deployment patterns, the platform’s load balancer often returns 503 when the downstream application (Architect flow) cannot process the request within the expected latency window.

Focus on the business impact metrics first. If the service level is maintained, the 503 may be a transient edge case rather than a systemic failure. Adjust the load test to incrementally increase concurrency by 100 users every five minutes to identify the precise breaking point.

This is actually a known issue with BYOC scaling thresholds. Check the connection limits in the docs here: https://developer.genesys.cloud/media-servers/byoc-scaling