Designing Robust QA Evaluation Workflows using AI-Driven Sentiment Analysis
What This Guide Covers
You are augmenting your Genesys Cloud Quality Management (QM) evaluation workflow with an AI-driven sentiment analysis pipeline. When complete, your QA team will no longer randomly select interactions for manual evaluation. Instead, 100% of interactions will be automatically scored for sentiment trajectory (improving vs. deteriorating), escalation risk, and empathy gaps using an NLP model. The top failure candidates are surfaced automatically in the QM queue with pre-filled AI insights, allowing evaluators to focus entirely on coaching conversations rather than triage.
Prerequisites, Roles & Licensing
- Genesys Cloud: CX 2 or 3 with Quality Management.
- Permissions required:
Analytics > Conversation Detail > ViewQuality > Evaluation > Add,EditQuality > Evaluation Form > ViewRecording > Recording > View(to attach recordings to evaluations)
- Infrastructure:
- An NLP backend (AWS Comprehend, Google Natural Language API, or a local model).
- A daily ETL pipeline to process completed interactions and create flagged evaluation tasks.
The Implementation Deep-Dive
1. The Random Sampling Problem
The most fundamental flaw in traditional contact center QA is random selection. If 1,000 calls occur per day and your team can manually evaluate 50, you review 5%. Of those 5%:
- ~40% will be perfectly normal interactions that need no coaching.
- ~10% will be the genuinely problematic calls your team is trying to find.
This means your team spends 80% of their evaluation time on interactions that have no coaching value. AI-driven triage inverts this.
2. Defining the Sentiment Trajectory Model
Rather than a simple “positive/negative” label, use a Sentiment Trajectory that captures how the interaction evolved over time.
from dataclasses import dataclass
from typing import Literal
@dataclass
class SentimentWindow:
utterance_index: int
utterance_text: str
speaker: Literal["AGENT", "CUSTOMER"]
sentiment: Literal["POSITIVE", "NEUTRAL", "NEGATIVE"]
confidence: float
@dataclass
class SentimentTrajectory:
conversation_id: str
windows: list[SentimentWindow]
@property
def initial_sentiment(self) -> str:
return self.windows[0].sentiment if self.windows else "NEUTRAL"
@property
def final_sentiment(self) -> str:
return self.windows[-1].sentiment if self.windows else "NEUTRAL"
@property
def trajectory_type(self) -> str:
"""Classifies the overall sentiment arc."""
initial = self.initial_sentiment
final = self.final_sentiment
if initial == "NEGATIVE" and final == "POSITIVE":
return "RECOVERY" # Agent turned it around - good example
elif initial in ("POSITIVE", "NEUTRAL") and final == "NEGATIVE":
return "DETERIORATION" # Agent may have caused frustration - flag for review
elif initial == "NEGATIVE" and final == "NEGATIVE":
return "SUSTAINED_NEGATIVE" # Persistent issue - flag
elif final == "POSITIVE":
return "POSITIVE" # Good interaction
else:
return "NEUTRAL"
Why trajectory matters: A call that starts negative and ends positive (“RECOVERY”) is a coaching success story that should be shared with the team. A call that starts neutral and ends negative (“DETERIORATION”) is where the agent contributed to the problem.
3. Building the Sentiment Analysis Pipeline
import boto3
import requests
from datetime import datetime, timedelta
COMPREHEND = boto3.client('comprehend', region_name='us-east-1')
GENESYS_API = "https://api.mypurecloud.com"
def analyze_conversation_sentiment(conversation: dict) -> SentimentTrajectory:
"""
Analyzes the sentiment of each utterance in a transcript.
Expects a conversation dict from the Analytics Detail API.
"""
windows = []
# Extract utterances from the transcript
for participant in conversation.get("participants", []):
role = "AGENT" if participant.get("purpose") == "agent" else "CUSTOMER"
for segment in participant.get("segments", []):
text = segment.get("transcript", "")
if not text or len(text.strip()) < 10:
continue
# Analyze with Amazon Comprehend
result = COMPREHEND.detect_sentiment(Text=text, LanguageCode='en')
windows.append(SentimentWindow(
utterance_index=len(windows),
utterance_text=text[:200], # Truncate for storage
speaker=role,
sentiment=result['Sentiment'],
confidence=max(result['SentimentScore'].values())
))
return SentimentTrajectory(
conversation_id=conversation['conversationId'],
windows=windows
)
def score_interaction(trajectory: SentimentTrajectory) -> dict:
"""
Converts a sentiment trajectory into a QA risk score.
Returns a dict suitable for attaching to a Genesys QM evaluation.
"""
trajectory_type = trajectory.trajectory_type
# Base risk scores
risk_score_map = {
"DETERIORATION": 85,
"SUSTAINED_NEGATIVE": 90,
"NEUTRAL": 20,
"POSITIVE": 10,
"RECOVERY": 15
}
risk_score = risk_score_map.get(trajectory_type, 30)
# Amplify score if the agent's utterances are dominantly negative
agent_negative_count = sum(
1 for w in trajectory.windows
if w.speaker == "AGENT" and w.sentiment == "NEGATIVE"
)
if agent_negative_count >= 3:
risk_score = min(100, risk_score + 15)
return {
"risk_score": risk_score,
"trajectory_type": trajectory_type,
"initial_sentiment": trajectory.initial_sentiment,
"final_sentiment": trajectory.final_sentiment,
"utterance_count": len(trajectory.windows),
"agent_negative_utterances": agent_negative_count
}
4. Automatically Creating QM Evaluation Tasks
For interactions with a risk_score >= 70, automatically create a Genesys Cloud QM Evaluation task assigned to the agent’s supervisor.
def create_flagged_evaluation(
conversation_id: str,
agent_id: str,
risk_data: dict,
evaluation_form_id: str,
supervisor_id: str,
access_token: str
):
"""Creates a QM evaluation task for a high-risk interaction."""
headers = {
"Authorization": f"Bearer {access_token}",
"Content-Type": "application/json"
}
# Build the evaluation note from AI insights
note = (
f"[AI FLAGGED - Risk Score: {risk_data['risk_score']}/100]\n"
f"Trajectory: {risk_data['trajectory_type']}\n"
f"Initial Sentiment: {risk_data['initial_sentiment']}\n"
f"Final Sentiment: {risk_data['final_sentiment']}\n"
f"Agent Negative Utterances: {risk_data['agent_negative_utterances']}\n"
f"Please review for empathy and de-escalation opportunities."
)
payload = {
"conversationId": conversation_id,
"agentId": agent_id,
"evaluationForm": {"id": evaluation_form_id},
"assignedTo": {"id": supervisor_id},
"releaseDate": datetime.utcnow().isoformat() + "Z",
"status": "PENDING",
"neverRelease": False,
# Pre-fill the agent briefing note visible to the evaluator
"agentHasRead": False,
"resourceStrings": {"note": note}
}
resp = requests.post(
f"{GENESYS_API}/api/v2/quality/evaluations",
headers=headers,
json=payload
)
resp.raise_for_status()
return resp.json()["id"]
Validation, Edge Cases & Troubleshooting
Edge Case 1: Language Misdetection
Amazon Comprehend’s detect_sentiment defaults to English. If a customer speaks French or Spanish and the API is called without the correct language code, the sentiment scores will be unreliable.
Solution: Run detect_dominant_language on the first 3 utterances of the conversation to determine the language, then pass the detected language code to detect_sentiment. If the language is unsupported by Comprehend, fall back to a neutral sentiment label and skip the interaction for automated flagging.
Edge Case 2: Short Interactions Skewing the Risk Score
A 30-second call where a customer immediately says “Never mind, I figured it out” might have only 2 utterances. If the second utterance is neutral but slightly negative in tone, the trajectory will be “DETERIORATION” and the risk score will be 85-even though it was a trivially short, benign interaction.
Solution: Add a minimum interaction threshold: only process interactions longer than 2 minutes (120 seconds) and with more than 8 utterances. Short interactions rarely have enough signal for meaningful trajectory analysis.
Edge Case 3: The Positive-Trajectory False Negative
An agent who has mastered the art of sounding pleasant while giving incorrect information will produce a “RECOVERY” or “POSITIVE” trajectory. The customer might leave satisfied but later discover the information was wrong.
Solution: Supplement sentiment analysis with a factual accuracy check. After scoring sentiment, pass the transcript through an LLM with access to your knowledge base and ask: “Did the agent provide accurate information about [detected topic]?” Use this as a second, independent QA signal alongside the sentiment score.