Running a load test on our custom BYOC Edge deployment (v2.4.1) and hitting a wall with WebSocket handshake failures. The setup is straightforward: Genesys Cloud initiates the call, routes to our Edge via SIP trunk, and we bridge to WebRTC for the client-side experience.
The issue isn’t with low volume. At 50 concurrent calls, everything works fine. Latency is stable, audio is clear. But once we push the JMeter script to simulate 500 simultaneous connection attempts within a 10-second window, about 30% of the WebSocket upgrade requests fail with a 502 Bad Gateway error right at the edge proxy layer.
Here is the specific error trace from the Edge container logs:
[ERROR] ws-handler: Handshake failed for session abc-123-def. Upstream connection reset. HTTP 502 returned to client.
[WARN] load-balancer: Backend pool exhausted. No healthy instances available for routing.
The JMeter config uses a constant throughput timer to spike the requests. We are using the latest Genesys Cloud JavaScript SDK (2.15.0) for the client side. The Edge pods are running on Kubernetes with horizontal pod autoscaling enabled, scaling from 3 to 10 replicas. The CPU usage on the pods spikes to 95% during the test, but memory is well within limits.
Is there a specific WebSocket connection limit per Edge pod that I am missing? Or is this a timeout issue where the Genesys Cloud platform is dropping the connection before the Edge can scale up? We tried increasing the idle_timeout and read_timeout in the nginx ingress controller, but the 502s persist.
Also, noticed that the media.server health checks pass, but the signaling.server shows intermittent high latency during the spike. Could this be a signaling bottleneck rather than a media one? Any advice on tuning the Edge ingress controller for high-concurrency WebSocket upgrades would be appreciated. Currently stuck on this validation step for our capacity planning report.