Implementing Automated Bot Performance Auditing using Ground-Truth Human Evaluations

StarAdmin · November 21, 2025, 9:00am

Implementing Automated Bot Performance Auditing using Ground-Truth Human Evaluations

What This Guide Covers

Architecting a continuous feedback loop to objectively measure Voice and Chatbot performance using Genesys Cloud Quality Management (QM).
Routing “Bot Hand-Offs” to a specialized human auditing queue where human evaluators score the bot’s accuracy, empathy, and containment failure reasons.
The end result is a highly quantified dashboard that tells you why your bots are failing, driven by human ground-truth data, rather than guessing based on raw NLU confidence scores.

Prerequisites, Roles & Licensing

Licensing: Genesys Cloud CX 2 or 3 (Quality Management).
Permissions: Quality > Evaluation > Edit, Architect > Flow > Edit, Routing > Queue > Edit.
Infrastructure: A designated team of QA evaluators trained to audit Bot transcripts.

The Implementation Deep-Dive

1. The Fallacy of “Containment Rate”

Most contact centers judge their bots by a single metric: Containment Rate (e.g., “The bot handled 40% of calls without an agent”).

The Trap:
A high containment rate does not equal a good customer experience. A poorly designed bot that frustrates the customer until they hang up is technically “contained.” Conversely, relying solely on NLU Confidence Scores is flawed because a bot can be 99% confident it understood the customer, but still execute the wrong business logic. You need qualitative human review.

2. Designing the Bot Evaluation Form

You must treat the Bot as an “Agent” and evaluate it using a Quality Management form.

Implementation Steps:

Navigate to Admin > Quality > Evaluation Forms.
Create a new form titled Bot Performance Audit v1.
Do not use standard agent questions (like “Did they say the company name?”). Instead, use bot-specific questions:
- Q1: Did the NLU correctly identify the primary intent? (Yes/No)
- Q2: Did the Bot extract the correct slot entities (e.g., Account Number)? (Yes/No)
- Q3: Was the Bot’s response clear and contextually accurate? (Yes/No)
- Q4: Why did the customer escalate to a human? (Multiple Choice: NLU Failure / Logic Loop / Complex Request / System Error / Asked for Agent Immediately)
Set the form to evaluate the Bot participant, not the Agent participant.

3. Architecting the Escalation Audit Flow

You cannot evaluate every bot interaction. You need a targeted sampling mechanism. We will focus on “Escalations” (when the bot hands off to a human) because that represents a containment failure.

Architectural Reasoning:
If you force a QA auditor to manually search for bot interactions, they will waste hours. You must push the interactions to them using a specialized Queue.

Implementation Steps:

In your Architect Bot Flow, locate the path where the bot escalates to an agent (e.g., the Transfer to ACD block).
Just before the transfer, use a Set Participant Data action to stamp the interaction with BotFailureReason = "Escalated".
Create a dedicated Quality Management Policy (Admin > Quality > Policies).
Set the condition: If Participant Data BotFailureReason exists AND the interaction contains a Bot participant.
Set the Action: Assign an Evaluation using the Bot Performance Audit v1 form.
Route these evaluations directly to the inbox of your specialized “Bot QA” team.

4. Evaluating “Contained” Interactions

Escalations tell you why the bot failed. But you must also audit a random sample of contained interactions to ensure the bot isn’t “containing” customers by frustrating them into hanging up.

Implementation Steps:

Create a second Quality Management Policy.
Set the condition: If interaction contains a Bot participant AND did NOT transfer to an ACD queue.
Set the Action: Randomly assign 5% of matching interactions for Evaluation.
In the Evaluation Form for contained calls, add a specific fatal question: Did the customer abandon the interaction due to bot failure/looping? If Yes, the bot’s score for that interaction is 0%.

Validation, Edge Cases & Troubleshooting

Edge Case 1: The “Silent Abandon” in Chat

The Failure Condition: An auditor is reviewing a Web Chat. The bot asked “Please enter your account number.” The customer never replied, and the chat timed out after 15 minutes. The auditor doesn’t know if the customer fixed it themselves or got frustrated.
The Root Cause: Asynchronous channels often end in “silent abandons” that are hard to classify.
The Solution: Train auditors to look at the preceding steps. Did the bot ask for the account number in a confusing way? Did the customer provide the number, but the bot failed to parse it using Regex, asking them again? If the bot repeated a prompt immediately before the abandon, it is a bot failure. If the bot provided the correct knowledge article and the customer abandoned, it is a successful containment.

Edge Case 2: Bot Form Calibration

The Failure Condition: Two QA evaluators review the exact same bot transcript. Evaluator A gives the bot a 90%, arguing the NLU worked perfectly. Evaluator B gives the bot a 40%, arguing the backend API returned the wrong data.
The Root Cause: Subjective interpretation of “Bot Performance.”
The Solution: You must run Calibration Sessions for your Bot QA team just like you do for human agents. Because bots are deterministic, any failure is ultimately a design failure. Your evaluation form must strictly separate NLU failures (the bot didn’t understand) from Logic failures (the bot understood, but the backend data dip failed).

Implementing Automated Bot Performance Auditing using Ground-Truth Human Evaluations

Implementing Automated Bot Performance Auditing using Ground-Truth Human Evaluations

What This Guide Covers

Prerequisites, Roles & Licensing

The Implementation Deep-Dive

1. The Fallacy of “Containment Rate”

2. Designing the Bot Evaluation Form

3. Architecting the Escalation Audit Flow

4. Evaluating “Contained” Interactions

Validation, Edge Cases & Troubleshooting

Edge Case 1: The “Silent Abandon” in Chat

Edge Case 2: Bot Form Calibration

Official References