significant data integrity issue within our Genesys Cloud Performance dashboards that requires immediate clarification regarding how queue occupancy metrics are calculated during SIP trunk failover scenarios.
Our environment consists of a primary SIP trunk with a capacity of 500 concurrent sessions and a secondary failover trunk configured for overflow handling. When the primary trunk reaches capacity and initiates failover to the secondary provider, we notice a distinct discrepancy in the ‘Queue Activity’ view. Specifically, the ‘Average Occupancy’ metric for agents in the affected queues spikes artificially high during the transition period, despite actual agent interaction times remaining constant.
This anomaly is particularly pronounced in the ‘Conversation Detail’ views, where calls routed via the secondary trunk show a higher ‘Talk Time’ relative to ‘Wait Time’ compared to those handled by the primary trunk, even though the call handling logic in Architect remains identical for both paths. We suspect this may be related to how the system attributes call duration when the SIP signaling path changes mid-conversation or during the initial setup phase of the failover.
From an operational governance perspective, this discrepancy undermines our ability to accurately assess agent performance and queue efficiency during peak load events. We need to understand whether this is a known limitation of the Performance dashboard’s real-time calculation engine or if there is a specific configuration setting in the SIP trunk routing that influences these metrics.
Has anyone else encountered similar metric inflation during SIP failover events? We are looking for insights on how to reconcile these dashboard figures with actual agent productivity data to ensure our performance reporting remains accurate and actionable for management review.
I ran into a similar metric skew during my recent BYOC load tests. The issue usually stems from how the dashboard aggregates occupancy when calls are in transit between trunk providers.
When the primary SIP trunk hits capacity, Genesys Cloud initiates the failover. During that handoff window, the calls are technically active but might not be assigned to a specific agent queue yet. The Performance dashboard often drops these “in-flight” calls from the occupancy calculation because they lack a definitive queue association for a few seconds.
To verify this, check the call_center_metrics API endpoint directly. The API returns real-time data that might show a higher occupancy than the cached dashboard view. Here is a quick snippet to pull the current queue metrics:
GET /api/v2/analytics/queues/summary?dateFrom=2024-01-01T00:00:00.000Z&dateTo=2024-01-01T23:59:59.999Z
Compare the occupancy field from the API response with the dashboard during a controlled failover test. If the API shows higher numbers, it confirms the dashboard is lagging or filtering out transitional states.
Also, check your JMeter logs if you are simulating the failover. Sometimes the WebSocket connection drops briefly during the trunk switch, causing the client-side dashboard to miss the update until the next heartbeat. This creates a temporary visual discrepancy rather than a true data integrity issue.
We usually see this resolve within 30-60 seconds after the failover stabilizes. If the discrepancy persists longer, it might be worth opening a support ticket to review the specific queue configuration for failover handling.
That handoff window explanation rings very true! It feels a lot like how Zendesk handled ticket routing when multiple side-by-side apps were fighting for focus. In Zendesk, a ticket might show as “Open” but sit in a grey area if the automation rules conflicted, causing temporary reporting gaps.
Genesys Cloud’s occupancy metric seems to behave similarly during SIP failover. The calls aren’t lost, but they are in a transitional state that the standard dashboard widgets don’t capture well.
To get a clearer picture, try building a custom report using the interaction view instead of the pre-built Performance dashboard. Filter by status = “connected” or “ringing” during the specific failover timestamps. This bypasses the dashboard’s aggregation logic and shows every single interaction state.
Also, check if the secondary trunk is configured with identical queue routing rules. If the failover sends calls to a different default queue, the occupancy metrics will definitely skew because you are comparing apples to oranges. It’s a small config tweak, but it makes a huge difference in data accuracy!
This metric skew is a known behavior in the reporting engine, but it can be mitigated by adjusting how we query the underlying data. The standard dashboard widgets often rely on real-time aggregations that drop calls during the SIP handshake transition. For a more accurate view, especially when preparing audit trails or capacity reports, bypassing the UI and using the Bulk Recording Export API directly against the S3 bucket is often more reliable.
When the failover occurs, the recording_id is still generated, but the queue_id might be temporarily null or associated with the failover group rather than the primary queue. If you filter by the primary queue ID alone, those “in-flight” calls disappear from the count.
Try adjusting your export payload to include the failover queue ID or use a broader filter that captures all active media sessions regardless of the specific queue assignment during the transition window. Here is an example of how to structure the bulk export request to capture these transitional recordings:
{
"recordingIds": [],
"filter": {
"type": "dateRange",
"startDate": "2024-05-01T00:00:00.000Z",
"endDate": "2024-05-31T23:59:59.999Z"
},
"includeMetadata": true,
"exportFormat": "JSON"
}
Once the job completes, you can parse the metadata in your S3 bucket. Look for the mediaType and direction fields. Calls that are inbound but lack a final agentId during the failover window are the ones causing the discrepancy. By exporting the raw metadata, you can manually reconcile the count against the SIP trunk logs. This approach ensures a complete chain of custody for all interactions, even those that appear to “drop” in the real-time dashboard due to the aggregation lag. It is a bit more manual, but it guarantees data integrity for compliance purposes.