Implementing Automated Trunk Quality of Service (QoS) Reporting with MOS Score Aggregation

Implementing Automated Trunk Quality of Service (QoS) Reporting with MOS Score Aggregation

What This Guide Covers

This guide details the architectural implementation of a real-time Mean Opinion Score (MOS) aggregation pipeline for SIP trunks in Genesys Cloud CX. You will build a custom integration that consumes raw SIP Call Detail Records (CDRs) via the Web Messaging API (WMA), calculates weighted MOS scores based on ITU-T P.862 standards, and triggers automated remediation workflows when voice quality degrades below defined thresholds. The end result is a self-healing telephony infrastructure that identifies trunk congestion or codec mismatches before they impact agent productivity or customer experience.

Prerequisites, Roles & Licensing

  • Licensing: Genesys Cloud CX 1 or higher. Access to Genesys Cloud Analytics is required for baseline comparison, but the primary data source for this automation is the WMA.
  • Permissions:
    • Telephony > Trunk > View
    • Telephony > Trunk > Edit (required for the automated remediation step)
    • API > Webhook > Edit
    • Integrations > Integration > Edit
  • OAuth Scopes: wma:inbound:read for subscribing to CDR streams.
  • External Dependencies:
    • A middleware environment (Node.js, Python, or AWS Lambda) capable of processing JSON payloads and executing REST API calls.
    • A message queue (e.g., AWS SQS, RabbitMQ, or Kafka) to buffer high-volume CDR ingestion during peak hours.
    • Access to the SIP Trunk Configuration UI for validating initial codec settings.

The Implementation Deep-Dive

1. Architecting the CDR Ingestion Pipeline via Web Messaging API (WMA)

The foundation of automated QoS reporting is the ability to intercept call metadata in near real-time. Genesys Cloud does not push MOS scores directly to standard webhooks in a pre-aggregated format suitable for immediate trunk-level remediation. Instead, you must subscribe to the CallCDR entity within the Web Messaging API. This provides the raw RTP statistics necessary to calculate or validate the MOS score.

Configuring the WMA Subscription

You must create a persistent subscription to the CallCDR entity. The critical configuration here is the filter expression. If you subscribe to all CDRs without filtering, you will overwhelm your middleware with irrelevant data (e.g., internal transfers, skill-based routing updates that do not generate voice traffic).

The Trap: Subscribing to CallCDR without filtering for state == "closed" and voiceQuality != null.
The Consequence: Your middleware receives CDR updates for every state change (initiated, ringing, answered). Most of these events lack RTP statistics. Processing these incomplete records leads to null pointer exceptions in your aggregation logic and significant latency in your queue. Furthermore, processing answered events without waiting for closed results in incomplete MOS calculations, as the final packet loss percentage is only known at call termination.

The Architectural Decision: We filter for state == "closed" and ensure the CDR contains a voiceQuality object. We also filter for direction == "outbound" or direction == "inbound" depending on which trunks you are monitoring. For comprehensive trunk health, monitor both.

API Endpoint: POST /api/v2/analytics/wmapi/subscriptions

JSON Payload:

{
  "entityId": "CallCDR",
  "filterExpression": "state == 'closed' AND voiceQuality != null AND (trunk.id != null)",
  "subscriptions": [
    {
      "id": "trunk-qos-monitor-sub",
      "description": "Subscription for real-time MOS aggregation on SIP trunks",
      "endpoint": "https://your-middleware-endpoint.com/api/v1/cdr-ingest",
      "authMethod": "basic",
      "credentials": {
        "username": "wma_consumer",
        "password": "secure_password_here"
      }
    }
  ]
}

Parsing the Voice Quality Object

When the CDR arrives, the voiceQuality object contains the following critical fields:

  • mosValue: The calculated Mean Opinion Score (0.0 - 5.0).
  • packetLoss: Percentage of lost RTP packets.
  • jitter: Maximum jitter in milliseconds.
  • codec: The negotiated codec (e.g., G711, G729, OPUS).

The Trap: Relying solely on mosValue without validating codec.
The Consequence: MOS scores are codec-dependent. A MOS of 3.5 on G711 is poor, but a MOS of 3.5 on G729 might be acceptable due to inherent compression artifacts. More critically, if a call negotiates OPUS but the downstream carrier only supports G711, the MOS may be artificially low due to transcoding latency. By ignoring the codec field, you risk flagging healthy G729 calls as degraded or missing degraded G711 calls that appear “okay” by raw number but suffer from latency.

Implementation Logic:
Your middleware must normalize MOS scores by codec. A common approach is to apply a weighting factor. For example:

  • G711: MOS > 4.0 is Good. Threshold for alert: < 3.5.
  • G729: MOS > 3.5 is Good. Threshold for alert: < 3.0.
  • OPUS: MOS > 4.2 is Good. Threshold for alert: < 3.8.

2. Aggregating MOS Scores and Detecting Trunk Degradation

Raw CDRs are noisy. A single bad call does not indicate a trunk failure. You must implement a sliding window aggregation algorithm to determine if a trunk is experiencing systemic issues.

The Aggregation Algorithm

Use a Weighted Moving Average (WMA) over a 5-minute window. This balances responsiveness with noise reduction.

  1. Ingest CDR: Receive CDR from WMA.
  2. Extract Metadata: Get trunk.id, timestamp, mosValue, codec.
  3. Normalize MOS: Apply codec-specific baseline adjustments.
  4. Store in Time-Series Database: Insert into Redis or InfluxDB with key trunk:{trunk_id}:mos and timestamp.
  5. Calculate Window Average: Every 60 seconds, calculate the average MOS for the last 300 seconds for each active trunk.

The Trap: Using a simple average without excluding outliers.
The Consequence: A single call with 100% packet loss (MOS 1.0) will skew the average of 100 healthy calls (MOS 4.5) downward. This causes false-positive alerts. Your aggregation logic must implement a trimmed mean, excluding the top and bottom 5% of MOS values in the window before calculating the average. This ensures that transient network blips do not trigger trunk failover.

Detecting Degradation

Define a Degradation Threshold. For this guide, we use a sustained MOS < 3.5 for G711 trunks over a 5-minute window.

Code Snippet (Pseudo-logic for Middleware):

def calculate_trunk_health(trunk_id, window_seconds=300):
    # Retrieve all MOS scores for the trunk in the last 5 minutes
    scores = timeseries_db.get_range(f"trunk:{trunk_id}:mos", now() - window_seconds, now())
    
    if len(scores) < 10:
        return "INSUFFICIENT_DATA"
    
    # Sort and trim outliers (remove bottom 5% and top 5%)
    sorted_scores = sorted([s['mos'] for s in scores])
    trim_count = max(1, int(len(sorted_scores) * 0.05))
    trimmed_scores = sorted_scores[trim_count:-trim_count]
    
    average_mos = sum(trimmed_scores) / len(trimmed_scores)
    
    # Define thresholds based on codec dominance in the window
    dominant_codec = get_dominant_codec(scores)
    threshold = CODEC_THRESHOLDS[dominant_codec]
    
    if average_mos < threshold:
        return "DEGRADED"
    else:
        return "HEALTHY"

3. Automating Remediation via Genesys Cloud REST APIs

When a trunk is flagged as DEGRADED, the system must take action. Manual intervention is too slow for real-time QoS issues. The most effective automated remediation is Trunk Failover.

Configuring Trunk Failover

Genesys Cloud supports Trunk Groups with Failover policies. However, the default behavior is static. To make it dynamic, you will use the REST API to modify the Trunk Group configuration or the Trunk status itself.

Option A: Disable the Trunk (Aggressive)
If a specific SIP trunk is failing, disable it. Traffic will automatically route to the next available trunk in the Trunk Group.

Option B: Adjust Trunk Weight (Gradual)
If you are using Load Balancing within a Trunk Group, reduce the weight of the degraded trunk. This gradually shifts traffic away from the bad trunk without a hard cut-over, which can cause call drops if not handled carefully.

The Trap: Disabling a trunk without checking for active calls.
The Consequence: If you disable a trunk via API while calls are active on it, those calls may be dropped depending on the carrier’s SIP implementation. Some carriers honor the SIP REGISTER refresh, others do not. Disabling the trunk in Genesys Cloud stops new calls, but existing calls remain. However, if the underlying issue is carrier-side congestion, existing calls may continue to degrade.

The Architectural Decision: We implement a Two-Phase Remediation:

  1. Phase 1: Reduce the trunk’s weight to 0 in the Trunk Group. This prevents new calls from being routed to it. Existing calls are allowed to complete.
  2. Phase 2: If degradation persists for 15 minutes, disable the trunk entirely.

API Endpoint: PATCH /api/v2/telephony/providers/edge/trunkgroups/{trunkGroupId}

JSON Payload (Reducing Weight):

{
  "trunks": [
    {
      "id": "degraded-trunk-id",
      "weight": 0,
      "maxConcurrentCalls": 0
    },
    {
      "id": "healthy-trunk-id",
      "weight": 100,
      "maxConcurrentCalls": 100
    }
  ]
}

API Endpoint: PATCH /api/v2/telephony/providers/edge/trunks/{trunkId}

JSON Payload (Disabling Trunk):

{
  "enabled": false
}

Implementing the Remediation Workflow

Your middleware must maintain state for each trunk to avoid flapping (rapidly enabling/disabling).

  1. State Machine: Each trunk has states: HEALTHY, DEGRADED, REMEDIATING, DISABLED.
  2. Hysteresis: Do not re-enable a trunk immediately after it is disabled. Wait for a Recovery Window (e.g., 30 minutes) and verify that the MOS on a test call (or historical data from a parallel trunk) is healthy.

The Trap: Flapping Remediation.
The Consequence: If the network issue is intermittent (e.g., a burst of packet loss), your script may disable the trunk, wait 5 minutes, re-enable it, and then disable it again. This constant state change can overwhelm the Genesys Cloud API rate limits and cause instability in the routing table.

Solution: Implement a Cooldown Period. Once a trunk is disabled, it remains disabled for a minimum of 30 minutes regardless of subsequent MOS readings. During this time, alerts are suppressed for that trunk to prevent alert fatigue.

4. Alerting and Reporting Integration

Automated remediation is not enough. You must notify the operations team and log the event for post-mortem analysis.

Sending Alerts via Genesys Cloud Messaging

Use the Messaging API to send an alert to a dedicated “Telephony Ops” queue or a Slack/Teams channel via webhook.

API Endpoint: POST /api/v2/messaging/conversations

JSON Payload:

{
  "type": "email",
  "to": ["ops-team@company.com"],
  "subject": "URGENT: Trunk Quality Degradation Detected - {Trunk Name}",
  "body": "Trunk {Trunk Name} (ID: {Trunk ID}) has experienced sustained MOS degradation (Avg MOS: {Avg MOS}) over the last 5 minutes. Automated failover has been initiated. Weight reduced to 0."
}

Logging to Genesys Cloud Analytics

To enable long-term trend analysis, log the QoS event to a custom Data Connector or API Data Source. This allows you to create dashboards in Genesys Cloud Analytics that correlate MOS degradation with specific carriers, times of day, or codec types.

The Trap: Logging every CDR to Analytics.
The Consequence: Genesys Cloud Analytics has ingestion limits. Logging every single CDR will quickly exhaust your storage quota and increase costs. Only log anomalies (calls with MOS < 3.0) and aggregated summaries (hourly MOS averages per trunk).

Validation, Edge Cases & Troubleshooting

Edge Case 1: Codec Negotiation Failures Masking as QoS Issues

The Failure Condition: MOS scores drop significantly, but packet loss and jitter are normal.
The Root Cause: The Genesys Cloud edge and the carrier are negotiating a sub-optimal codec (e.g., G722 instead of G711) due to configuration mismatches in the SIP Trunk profile. The G722 codec may have higher overhead or latency on certain carrier networks.
The Solution: Check the codec field in the CDR. If the dominant codec is not G711 or OPUS, investigate the SIP Trunk Profile in Genesys Cloud. Ensure the Codec Priority is set correctly. Disable G729 and G722 if they are not strictly required for bandwidth savings.

Edge Case 2: Asymmetric Routing Causing False Low MOS

The Failure Condition: MOS is low on inbound calls but high on outbound calls (or vice versa) for the same trunk.
The Root Cause: Asymmetric routing occurs when the SIP INVITE takes one path but the RTP media takes another. If the return path has higher latency or packet loss, the MOS calculated on the Genesys Cloud edge (which measures the path to the edge) may not reflect the end-to-end quality. However, Genesys Cloud CDRs typically report the quality as observed by the edge. If the carrier reports different stats, there is a discrepancy.
The Solution: Verify that RTP Symmetry is enforced in your firewall and carrier configuration. Ensure that the source IP of the SIP signaling matches the source IP of the RTP media. In Genesys Cloud, check the Trunk Configuration for RTP Symmetry settings.

Edge Case 3: WMA Subscription Lag During Peak Hours

The Failure Condition: Alerts are delayed by 10-15 minutes.
The Root Cause: The WMA subscription queue is backing up because the middleware cannot process CDRs fast enough during peak call volumes.
The Solution: Scale the middleware consumers. Use a message queue with batch processing. Instead of processing each CDR individually, batch 100 CDRs and process them in parallel. Ensure your database writes are asynchronous. Monitor the WMA Subscription Health in the Genesys Cloud Admin console to detect backpressure.

Official References