Designing Robust QA Evaluation Workflows using AI-Driven Sentiment Analysis

StarAdmin · December 5, 2025, 9:00am

Designing Robust QA Evaluation Workflows using AI-Driven Sentiment Analysis

What This Guide Covers

You are augmenting your Genesys Cloud Quality Management (QM) evaluation workflow with an AI-driven sentiment analysis pipeline. When complete, your QA team will no longer randomly select interactions for manual evaluation. Instead, 100% of interactions will be automatically scored for sentiment trajectory (improving vs. deteriorating), escalation risk, and empathy gaps using an NLP model. The top failure candidates are surfaced automatically in the QM queue with pre-filled AI insights, allowing evaluators to focus entirely on coaching conversations rather than triage.

Prerequisites, Roles & Licensing

Genesys Cloud: CX 2 or 3 with Quality Management.
Permissions required:
- Analytics > Conversation Detail > View
- Quality > Evaluation > Add, Edit
- Quality > Evaluation Form > View
- Recording > Recording > View (to attach recordings to evaluations)
Infrastructure:
- An NLP backend (AWS Comprehend, Google Natural Language API, or a local model).
- A daily ETL pipeline to process completed interactions and create flagged evaluation tasks.

The Implementation Deep-Dive

1. The Random Sampling Problem

The most fundamental flaw in traditional contact center QA is random selection. If 1,000 calls occur per day and your team can manually evaluate 50, you review 5%. Of those 5%:

~40% will be perfectly normal interactions that need no coaching.
~10% will be the genuinely problematic calls your team is trying to find.

This means your team spends 80% of their evaluation time on interactions that have no coaching value. AI-driven triage inverts this.

2. Defining the Sentiment Trajectory Model

Rather than a simple “positive/negative” label, use a Sentiment Trajectory that captures how the interaction evolved over time.

from dataclasses import dataclass
from typing import Literal

@dataclass
class SentimentWindow:
    utterance_index: int
    utterance_text: str
    speaker: Literal["AGENT", "CUSTOMER"]
    sentiment: Literal["POSITIVE", "NEUTRAL", "NEGATIVE"]
    confidence: float

@dataclass
class SentimentTrajectory:
    conversation_id: str
    windows: list[SentimentWindow]
    
    @property
    def initial_sentiment(self) -> str:
        return self.windows[0].sentiment if self.windows else "NEUTRAL"
    
    @property
    def final_sentiment(self) -> str:
        return self.windows[-1].sentiment if self.windows else "NEUTRAL"
    
    @property
    def trajectory_type(self) -> str:
        """Classifies the overall sentiment arc."""
        initial = self.initial_sentiment
        final = self.final_sentiment
        
        if initial == "NEGATIVE" and final == "POSITIVE":
            return "RECOVERY"       # Agent turned it around - good example
        elif initial in ("POSITIVE", "NEUTRAL") and final == "NEGATIVE":
            return "DETERIORATION"  # Agent may have caused frustration - flag for review
        elif initial == "NEGATIVE" and final == "NEGATIVE":
            return "SUSTAINED_NEGATIVE"  # Persistent issue - flag
        elif final == "POSITIVE":
            return "POSITIVE"       # Good interaction
        else:
            return "NEUTRAL"

Why trajectory matters: A call that starts negative and ends positive (“RECOVERY”) is a coaching success story that should be shared with the team. A call that starts neutral and ends negative (“DETERIORATION”) is where the agent contributed to the problem.

3. Building the Sentiment Analysis Pipeline

import boto3
import requests
from datetime import datetime, timedelta

COMPREHEND = boto3.client('comprehend', region_name='us-east-1')
GENESYS_API = "https://api.mypurecloud.com"

def analyze_conversation_sentiment(conversation: dict) -> SentimentTrajectory:
    """
    Analyzes the sentiment of each utterance in a transcript.
    Expects a conversation dict from the Analytics Detail API.
    """
    windows = []
    
    # Extract utterances from the transcript
    for participant in conversation.get("participants", []):
        role = "AGENT" if participant.get("purpose") == "agent" else "CUSTOMER"
        
        for segment in participant.get("segments", []):
            text = segment.get("transcript", "")
            if not text or len(text.strip()) < 10:
                continue
            
            # Analyze with Amazon Comprehend
            result = COMPREHEND.detect_sentiment(Text=text, LanguageCode='en')
            
            windows.append(SentimentWindow(
                utterance_index=len(windows),
                utterance_text=text[:200],  # Truncate for storage
                speaker=role,
                sentiment=result['Sentiment'],
                confidence=max(result['SentimentScore'].values())
            ))
    
    return SentimentTrajectory(
        conversation_id=conversation['conversationId'],
        windows=windows
    )

def score_interaction(trajectory: SentimentTrajectory) -> dict:
    """
    Converts a sentiment trajectory into a QA risk score.
    Returns a dict suitable for attaching to a Genesys QM evaluation.
    """
    trajectory_type = trajectory.trajectory_type
    
    # Base risk scores
    risk_score_map = {
        "DETERIORATION": 85,
        "SUSTAINED_NEGATIVE": 90,
        "NEUTRAL": 20,
        "POSITIVE": 10,
        "RECOVERY": 15
    }
    
    risk_score = risk_score_map.get(trajectory_type, 30)
    
    # Amplify score if the agent's utterances are dominantly negative
    agent_negative_count = sum(
        1 for w in trajectory.windows
        if w.speaker == "AGENT" and w.sentiment == "NEGATIVE"
    )
    if agent_negative_count >= 3:
        risk_score = min(100, risk_score + 15)
    
    return {
        "risk_score": risk_score,
        "trajectory_type": trajectory_type,
        "initial_sentiment": trajectory.initial_sentiment,
        "final_sentiment": trajectory.final_sentiment,
        "utterance_count": len(trajectory.windows),
        "agent_negative_utterances": agent_negative_count
    }

4. Automatically Creating QM Evaluation Tasks

For interactions with a risk_score >= 70, automatically create a Genesys Cloud QM Evaluation task assigned to the agent’s supervisor.

def create_flagged_evaluation(
    conversation_id: str,
    agent_id: str,
    risk_data: dict,
    evaluation_form_id: str,
    supervisor_id: str,
    access_token: str
):
    """Creates a QM evaluation task for a high-risk interaction."""
    
    headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "application/json"
    }
    
    # Build the evaluation note from AI insights
    note = (
        f"[AI FLAGGED - Risk Score: {risk_data['risk_score']}/100]\n"
        f"Trajectory: {risk_data['trajectory_type']}\n"
        f"Initial Sentiment: {risk_data['initial_sentiment']}\n"
        f"Final Sentiment: {risk_data['final_sentiment']}\n"
        f"Agent Negative Utterances: {risk_data['agent_negative_utterances']}\n"
        f"Please review for empathy and de-escalation opportunities."
    )
    
    payload = {
        "conversationId": conversation_id,
        "agentId": agent_id,
        "evaluationForm": {"id": evaluation_form_id},
        "assignedTo": {"id": supervisor_id},
        "releaseDate": datetime.utcnow().isoformat() + "Z",
        "status": "PENDING",
        "neverRelease": False,
        # Pre-fill the agent briefing note visible to the evaluator
        "agentHasRead": False,
        "resourceStrings": {"note": note}
    }
    
    resp = requests.post(
        f"{GENESYS_API}/api/v2/quality/evaluations",
        headers=headers,
        json=payload
    )
    resp.raise_for_status()
    return resp.json()["id"]

Validation, Edge Cases & Troubleshooting

Edge Case 1: Language Misdetection

Amazon Comprehend’s detect_sentiment defaults to English. If a customer speaks French or Spanish and the API is called without the correct language code, the sentiment scores will be unreliable.
Solution: Run detect_dominant_language on the first 3 utterances of the conversation to determine the language, then pass the detected language code to detect_sentiment. If the language is unsupported by Comprehend, fall back to a neutral sentiment label and skip the interaction for automated flagging.

Edge Case 2: Short Interactions Skewing the Risk Score

A 30-second call where a customer immediately says “Never mind, I figured it out” might have only 2 utterances. If the second utterance is neutral but slightly negative in tone, the trajectory will be “DETERIORATION” and the risk score will be 85-even though it was a trivially short, benign interaction.
Solution: Add a minimum interaction threshold: only process interactions longer than 2 minutes (120 seconds) and with more than 8 utterances. Short interactions rarely have enough signal for meaningful trajectory analysis.

Edge Case 3: The Positive-Trajectory False Negative

An agent who has mastered the art of sounding pleasant while giving incorrect information will produce a “RECOVERY” or “POSITIVE” trajectory. The customer might leave satisfied but later discover the information was wrong.
Solution: Supplement sentiment analysis with a factual accuracy check. After scoring sentiment, pass the transcript through an LLM with access to your knowledge base and ask: “Did the agent provide accurate information about [detected topic]?” Use this as a second, independent QA signal alongside the sentiment score.

Designing Robust QA Evaluation Workflows using AI-Driven Sentiment Analysis

Designing Robust QA Evaluation Workflows using AI-Driven Sentiment Analysis

What This Guide Covers

Prerequisites, Roles & Licensing

The Implementation Deep-Dive

1. The Random Sampling Problem

2. Defining the Sentiment Trajectory Model

3. Building the Sentiment Analysis Pipeline

4. Automatically Creating QM Evaluation Tasks

Validation, Edge Cases & Troubleshooting

Edge Case 1: Language Misdetection

Edge Case 2: Short Interactions Skewing the Risk Score

Edge Case 3: The Positive-Trajectory False Negative

Official References