Designing Calibration Session Workflows for Standardizing Evaluator Scoring Consistency
What This Guide Covers
You are designing a formal Quality Management calibration process within Genesys Cloud that ensures multiple supervisors and quality evaluators score the same agent interactions consistently-eliminating the inter-rater reliability problem where one evaluator gives a call 85/100 and another gives the identical call 71/100 based on subjective interpretation. When complete, your calibration workflow will systematically select representative interaction recordings, distribute them to a defined panel of evaluators, collect independent scores through Genesys Cloud Quality Management evaluation forms, compare results statistically, facilitate a calibration discussion session to resolve significant variances, and update scoring rubric documentation to codify agreed interpretations-ensuring your quality program measures performance accurately rather than measuring evaluator variance.
Prerequisites, Roles & Licensing
- Genesys Cloud: CX 3 with the Quality Management module.
- Permissions required:
Quality > Evaluation > View/Create/DeleteQuality > Calibration > Add/Edit/View
- Process requirement: Calibration works best when done monthly (for stable teams) or bi-weekly (for new QM programs). Establish a fixed calibration schedule before implementing automated workflows.
The Implementation Deep-Dive
1. Why Calibration Matters - The Inter-Rater Reliability Problem
Without calibration, quality scores are fundamentally unreliable:
| Scenario | Evaluator A Score | Evaluator B Score | Variance |
|---|---|---|---|
| Agent uses empathy phrase | 90 | 78 | -12 pts |
| Agent interrupts customer once | 75 | 85 | +10 pts |
| Agent offers proactive info | 88 | 70 | -18 pts |
| Same agent, same week | avg: 84 | avg: 78 | 6 pts systematic bias |
A 6-point systematic evaluator bias means Agent A (evaluated by Evaluator A) consistently appears to perform better than Agent B (evaluated by Evaluator B)-even if their actual performance is identical. Calibration eliminates this by aligning evaluator interpretation of each rubric criterion.
2. Interaction Selection for Calibration
Calibration sessions require carefully selected “calibration interactions”-recordings that represent specific scenarios, not the easiest or most average calls:
import requests
from datetime import datetime, timedelta
import random
GENESYS_API = "https://api.mypurecloud.com"
def select_calibration_candidates(
access_token: str,
session_focus: str, # e.g., "empathy_handling", "escalation_process", "compliance_phrases"
target_count: int = 5
) -> list:
"""
Selects calibration candidates based on session focus.
Prioritizes interactions that represent edge cases and rubric gray areas.
"""
end_time = datetime.utcnow()
start_time = end_time - timedelta(days=14) # Last 2 weeks
# Query interactions with recordings available
query = {
"interval": f"{start_time.strftime('%Y-%m-%dT%H:%M:%S.000Z')}/{end_time.strftime('%Y-%m-%dT%H:%M:%S.000Z')}",
"order": "asc",
"orderBy": "conversationStart",
"paging": {"pageSize": 200, "pageNumber": 1},
"segmentFilters": [
{
"type": "and",
"predicates": [
{"type": "dimension", "dimension": "mediaType", "value": "voice"},
{"type": "metric", "metric": "tTalkComplete", "range": {"gte": 180}} # ≥3 min calls
]
}
]
}
resp = requests.post(
f"{GENESYS_API}/api/v2/analytics/conversations/details/query",
headers={"Authorization": f"Bearer {access_token}", "Content-Type": "application/json"},
json=query
)
conversations = resp.json().get("conversations", [])
# Filter for interactions with recordings
with_recordings = [
c for c in conversations
if any(s.get("recordings") for p in c.get("participants", [])
for s in p.get("sessions", []))
]
# Stratified selection: include different handle time ranges, different agents
agents_selected = set()
candidates = []
for conv in with_recordings:
agent_id = next(
(p["userId"] for p in conv.get("participants", []) if p.get("purpose") == "agent"),
None
)
# Ensure each calibration session includes interactions from at least 3 different agents
if agent_id and agent_id not in agents_selected and len(candidates) < target_count:
candidates.append(conv)
agents_selected.add(agent_id)
# Fill remaining slots randomly from remaining pool
remaining_pool = [c for c in with_recordings if c not in candidates]
candidates.extend(random.sample(remaining_pool, min(target_count - len(candidates), len(remaining_pool))))
return candidates[:target_count]
3. Creating Calibrations in Genesys Cloud
Use the Calibrations API to create a calibration session and assign evaluators:
def create_calibration_session(
access_token: str,
conversation_id: str,
evaluator_ids: list,
calibrator_id: str, # The "calibrator" is the authoritative evaluator who sets the target score
evaluation_form_id: str,
session_name: str
) -> dict:
"""
Creates a Genesys Cloud calibration session for a specific interaction.
"""
payload = {
"conversation": {"id": conversation_id},
"evaluationForm": {"id": evaluation_form_id},
"calibrator": {"id": calibrator_id},
"evaluators": [{"id": eid} for eid in evaluator_ids],
"expertEvaluator": {"id": calibrator_id} # Expert sets the reference score
}
resp = requests.post(
f"{GENESYS_API}/api/v2/quality/calibrations",
headers={"Authorization": f"Bearer {access_token}", "Content-Type": "application/json"},
json=payload
)
calibration = resp.json()
print(f"✓ Calibration created: {calibration.get('id')} for conversation {conversation_id}")
return calibration
Bulk-create calibration sessions for an entire monthly session:
def setup_monthly_calibration(access_token: str, evaluator_ids: list, calibrator_id: str, form_id: str):
candidates = select_calibration_candidates(access_token, session_focus="general", target_count=5)
calibrations = []
for i, conv in enumerate(candidates):
cal = create_calibration_session(
access_token=access_token,
conversation_id=conv["conversationId"],
evaluator_ids=evaluator_ids,
calibrator_id=calibrator_id,
evaluation_form_id=form_id,
session_name=f"Monthly Calibration #{i+1} - {datetime.utcnow().strftime('%B %Y')}"
)
calibrations.append(cal)
print(f"✓ {len(calibrations)} calibration sessions created.")
return calibrations
4. Scoring Variance Analysis
After evaluators complete their independent scores, analyze the variance before the calibration discussion:
def analyze_calibration_variance(access_token: str, calibration_id: str) -> dict:
"""
Retrieves all evaluator scores and computes variance metrics.
"""
resp = requests.get(
f"{GENESYS_API}/api/v2/quality/calibrations/{calibration_id}",
headers={"Authorization": f"Bearer {access_token}"}
)
calibration = resp.json()
evaluations = calibration.get("evaluations", [])
scores = [e.get("totalScore", 0) for e in evaluations if e.get("status") == "Finished"]
if len(scores) < 2:
return {"status": "insufficient_evaluations", "scores": scores}
import statistics
return {
"calibration_id": calibration_id,
"evaluator_count": len(scores),
"scores": scores,
"mean": round(statistics.mean(scores), 2),
"std_dev": round(statistics.stdev(scores), 2),
"min": min(scores),
"max": max(scores),
"range": max(scores) - min(scores),
"variance_flag": (max(scores) - min(scores)) > 15, # Flag if range > 15 points
"expert_score": next(
(e.get("totalScore") for e in evaluations if e.get("evaluator", {}).get("id") == calibration.get("expertEvaluator", {}).get("id")),
None
)
}
Calibration Discussion Trigger:
- Range ≤ 5 points → Excellent alignment. Brief discussion to confirm interpretation. No rubric update needed.
- Range 6-15 points → Discuss gray-area criteria. Document agreed interpretation in rubric notes.
- Range > 15 points → Full calibration session required. Rubric criterion likely needs rewriting for clarity.
Validation, Edge Cases & Troubleshooting
Edge Case 1: Evaluators Discuss Scores Before Completing Independent Evaluations
The QM team uses a group chat. An evaluator posts “I gave this call a 78” before others have submitted. This anchoring bias invalidates the calibration’s independence requirement.
Solution: Enforce sequential access: assign calibrations with a deadline, and configure the calibration form to hide other evaluators’ scores until all assigned evaluators have submitted. Use Genesys Cloud’s built-in calibration status tracking to confirm all evaluations are “Finished” before revealing scores.
Edge Case 2: Calibration Session Has High Variance on the Same Criterion Every Month
The “Product Knowledge” criterion consistently produces 15+ point variance because evaluators have different definitions of what constitutes “good” product knowledge.
Solution: This is a rubric quality issue, not an evaluator issue. Rewrite the criterion with behavioral anchors: instead of “Demonstrates strong product knowledge,” use “Agent correctly identifies the customer’s product model and provides accurate troubleshooting steps specific to that model without referring to documentation more than once.” Behavioral anchors reduce subjective interpretation.
Edge Case 3: Calibrator (Expert Evaluator) Consistently Scores Higher Than Team
The designated calibrator’s target scores are systematically 10-12 points above the team average. Over time, agents evaluated by the calibrator’s target score appear to underperform.
Solution: Rotate the calibrator role. Use a calibration-of-calibrators: quarterly, have the full QM leadership team (including external benchmarking if possible) evaluate the same interaction and compare to recent calibrator scores. This prevents a single evaluator from becoming the de facto performance standard without challenge.