Implementing Prometheus and Grafana Monitoring for BYOC Premise SBC Health Metrics

Implementing Prometheus and Grafana Monitoring for BYOC Premise SBC Health Metrics

What This Guide Covers

You are building a production observability stack that collects real-time health metrics from your on-premise Session Border Controller (SBC) - the critical network boundary device that bridges your corporate telephony infrastructure to Genesys Cloud BYOC (Bring Your Own Carrier) trunks - and visualizes them in Grafana dashboards with automated alerting. When complete, your operations team will have live visibility into SBC registration status, active call counts, SIP trunk utilization, packet loss, MOS (Mean Opinion Score) per-trunk, and CPU/memory on the SBC appliance itself. Alerts will fire in PagerDuty before service degradation impacts agents, rather than after customers start reporting audio quality issues.


Prerequisites, Roles & Licensing

  • Genesys Cloud: BYOC Premise license with an on-premise Edge server or third-party SBC (AudioCodes, Ribbon/GENBAND, Oracle ACME Packet, Cisco CUBE).
  • Infrastructure:
    • Prometheus 2.x (self-hosted or managed via Grafana Cloud)
    • Grafana 10.x
    • SBC SNMP v2c or v3 enabled, or SBC REST API access for metric export
    • A Linux metrics collector host (t3.small) with network access to the SBC management interface

The Implementation Deep-Dive

1. SBC Metric Collection Architecture

[AudioCodes SBC] ──SNMP──▶ [SNMP Exporter] ──HTTP──▶ [Prometheus]
[Ribbon SBC]     ──REST──▶ [Custom Exporter]            │
[ACME Packet]    ──SNMP──▶ [SNMP Exporter]              │
                                                          ▼
[Genesys Cloud API] ──▶ [Custom Exporter] ──HTTP──▶ [Prometheus]
                                                          │
                                                          ▼
                                                     [Grafana]
                                                     [PagerDuty]

Key SBC metrics to collect:

Metric Source Alert Threshold
Active calls per trunk SNMP / REST > 90% trunk capacity
SIP registration status SNMP 0 = CRITICAL
Packet loss % SNMP (RTP stats) > 1% = WARNING, > 3% = CRITICAL
Jitter (ms) SNMP (RTP stats) > 30ms = WARNING
MOS score REST API (if available) < 3.5 = WARNING, < 3.0 = CRITICAL
SBC CPU utilization SNMP > 80% = WARNING
SBC memory utilization SNMP > 85% = WARNING
Trunk group call attempts SNMP Baseline + 3σ = ANOMALY

2. SNMP Exporter Configuration for AudioCodes SBC

# /etc/prometheus/snmp_audiocodes.yml
modules:
  audiocodes_sbc:
    walk:
      # Active calls per trunk group
      - 1.3.6.1.4.1.5003.9.10.10.1.2     # acSBCTrunkGroupStatCurrentCallsNum
      # SIP registration status
      - 1.3.6.1.4.1.5003.9.10.10.1.3     # acSBCTrunkGroupStatStatus
      # Packet loss
      - 1.3.6.1.4.1.5003.9.10.10.2.1.7   # acSBCCallMediaIPGroupRTPLossRate
      # Jitter
      - 1.3.6.1.4.1.5003.9.10.10.2.1.9   # acSBCCallMediaIPGroupRTPJitter
      # CPU utilization
      - 1.3.6.1.4.1.5003.9.10.10.1.28    # acSBCCPUUtilization
      
    metrics:
      - name: sbc_trunk_active_calls
        oid: 1.3.6.1.4.1.5003.9.10.10.1.2
        type: gauge
        help: "Current number of active calls on trunk group"
        indexes:
          - labelname: trunk_group
            type: gauge
            
      - name: sbc_trunk_registration_status
        oid: 1.3.6.1.4.1.5003.9.10.10.1.3
        type: gauge
        help: "SIP trunk registration status (1=registered, 0=unregistered)"
        
      - name: sbc_rtp_packet_loss_rate
        oid: 1.3.6.1.4.1.5003.9.10.10.2.1.7
        type: gauge
        help: "RTP packet loss rate percentage"
        
      - name: sbc_cpu_utilization_percent
        oid: 1.3.6.1.4.1.5003.9.10.10.1.28
        type: gauge
        help: "SBC CPU utilization percentage"

    version: 2
    community: your-snmp-community-string
    timeout: 10s
    retries: 3

3. Custom Genesys Cloud Edge / Trunk Metrics Exporter

Supplement SNMP with Genesys Cloud API data for end-to-end correlated visibility:

#!/usr/bin/env python3
"""
genesys_byoc_exporter.py - Prometheus exporter for BYOC trunk health
Runs on port 9091 as a Prometheus target
"""

from prometheus_client import Gauge, start_http_server
import requests, time, os

GENESYS_API = "https://api.mypurecloud.com"
POLL_INTERVAL = 30  # seconds

# Define Prometheus gauges
trunk_active_calls    = Gauge('genesys_trunk_active_calls',    'Active calls on trunk', ['trunk_id', 'trunk_name'])
trunk_status          = Gauge('genesys_trunk_status',          'Trunk status (1=active)', ['trunk_id', 'trunk_name'])
edge_status           = Gauge('genesys_edge_status',           'Edge status (1=online)',  ['edge_id', 'edge_name'])
edge_calls_in_progress = Gauge('genesys_edge_calls_in_progress', 'Calls on edge', ['edge_id', 'edge_name'])

def get_token() -> str:
    resp = requests.post(
        "https://login.mypurecloud.com/oauth/token",
        data={"grant_type": "client_credentials"},
        auth=(os.environ["GC_CLIENT_ID"], os.environ["GC_CLIENT_SECRET"])
    )
    return resp.json()["access_token"]

def collect_metrics(token: str):
    headers = {"Authorization": f"Bearer {token}"}
    
    # --- BYOC Trunks ---
    trunks = requests.get(f"{GENESYS_API}/api/v2/telephony/providers/edges/trunks",
                          headers=headers, params={"pageSize": 100}).json()
    
    for trunk in trunks.get("entities", []):
        tid  = trunk["id"]
        name = trunk.get("name", tid)
        is_active = 1 if trunk.get("trunkType") == "REGISTERED" else 0
        
        trunk_status.labels(trunk_id=tid, trunk_name=name).set(is_active)
        # Active calls pulled from SNMP; here we just set registration status
    
    # --- Edge Servers ---
    edges = requests.get(f"{GENESYS_API}/api/v2/telephony/providers/edges",
                         headers=headers, params={"pageSize": 100}).json()
    
    for edge in edges.get("entities", []):
        eid  = edge["id"]
        name = edge.get("name", eid)
        online = 1 if edge.get("onlineStatus") == "ONLINE" else 0
        calls  = edge.get("callDrainingState", {}).get("draining", False)
        
        edge_status.labels(edge_id=eid, edge_name=name).set(online)
        edge_calls_in_progress.labels(edge_id=eid, edge_name=name).set(0 if calls else 1)

def main():
    start_http_server(9091)
    print("Genesys BYOC Exporter listening on :9091")
    
    token = get_token()
    token_refresh_at = time.time() + 1700  # Refresh before 30-min expiry
    
    while True:
        if time.time() > token_refresh_at:
            token = get_token()
            token_refresh_at = time.time() + 1700
        
        collect_metrics(token)
        time.sleep(POLL_INTERVAL)

if __name__ == "__main__":
    main()

4. Prometheus Scrape Configuration

# prometheus.yml - append to scrape_configs
scrape_configs:
  - job_name: 'sbc_snmp_audiocodes'
    static_configs:
      - targets:
          - '192.168.10.20'   # Primary SBC management IP
          - '192.168.10.21'   # Secondary SBC (HA pair)
    metrics_path: /snmp
    params:
      module: [audiocodes_sbc]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9116   # SNMP Exporter address

  - job_name: 'genesys_byoc'
    static_configs:
      - targets: ['localhost:9091']
    scrape_interval: 30s

5. Grafana Alert Rules

# grafana/alerts/sbc_health.yml
groups:
  - name: SBC Health
    rules:
      - alert: SBCTrunkUnregistered
        expr: sbc_trunk_registration_status == 0
        for: 1m
        labels:
          severity: critical
          team: telephony
        annotations:
          summary: "SBC Trunk {{ $labels.trunk_group }} is UNREGISTERED"
          description: "Trunk has been down for >1 minute. Calls may be failing."
          runbook: "https://wiki.internal/runbooks/sbc-trunk-recovery"
      
      - alert: SBCHighPacketLoss
        expr: sbc_rtp_packet_loss_rate > 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "SBC packet loss {{ $value }}% on {{ $labels.instance }}"
      
      - alert: GenesysEdgeOffline
        expr: genesys_edge_status == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Genesys Edge {{ $labels.edge_name }} is OFFLINE"

Validation, Edge Cases & Troubleshooting

Edge Case 1: SNMP Walk Returns No Data for Some OIDs

Your SBC firmware version uses a different OID branch for trunk stats than the exporter config.
Solution: Use snmpwalk -v2c -c your-community 192.168.10.20 1.3.6.1.4.1.5003 to discover the actual OID tree for your firmware version. AudioCodes MIBS are downloadable from their support portal and importable into MIB browsers (iReasoning, Net-SNMP) for visual OID discovery.

Edge Case 2: Exporter Token Expires During Collection

The Python exporter’s 30-minute OAuth token expires mid-collection cycle, causing a burst of 401 errors logged by Prometheus.
Solution: The exporter already handles this with token_refresh_at logic that refreshes at 28 minutes. If the Genesys Cloud token endpoint is temporarily slow, add a try/except around get_token() with a 60-second retry. Do not let a token refresh failure crash the exporter - continue serving stale metrics with a genesys_exporter_token_healthy gauge set to 0.

Edge Case 3: Grafana Shows Gaps When SBC Has Maintenance Window

Planned SBC maintenance causes alert noise. On-call engineers get woken up for expected downtime.
Solution: Use Grafana’s Silence feature (or Alertmanager’s silence API) to suppress alerts during maintenance windows. Automate silence creation via the Grafana API from your change management system: when a maintenance ticket is opened, automatically create a 2-hour silence for job="sbc_snmp_audiocodes".

Official References