Implementing Knowledge Gap Analysis Pipelines Using Unresolved Interaction Transcript Mining
What This Guide Covers
You are building an automated pipeline that mines unresolved interaction transcripts - conversations where agents couldn’t find an answer, escalated to a supervisor, or where the customer’s issue remained open - and identifies the specific knowledge gaps in your knowledge base. When complete, your system will nightly process all interactions marked with “unresolved” or “escalated” wrap-up codes, extract the customer’s core questions using NLP entity/intent extraction, compare those questions against your existing knowledge base articles using semantic similarity search, identify questions with no matching article (true gaps), and generate a prioritized “Knowledge Gap Report” showing exactly which articles need to be written, ranked by frequency of the unanswered question.
Prerequisites, Roles & Licensing
- Genesys Cloud: CX 3 with Speech and Text Analytics or the Transcription add-on.
- Infrastructure:
- Access to the Analytics API for conversation transcripts
- A vector database (Qdrant, Pinecone, or pgvector) indexing your existing knowledge base
- An NLP pipeline for question extraction (spaCy + custom rules, or LLM-based)
- A reporting destination (Snowflake, BigQuery, or a simple PostgreSQL table)
The Implementation Deep-Dive
1. The Knowledge Gap Detection Architecture
[Genesys Cloud: Unresolved Interactions (wrap-up = "escalated" or "unresolved")]
|
v (Analytics API: nightly batch query)
[Extract Transcripts: customer utterances only]
|
v
[NLP: Extract customer questions and core intent phrases]
|
v
[Vector Search: Compare each question against knowledge base embeddings]
|
|-- Match score > 0.80 → COVERED (article exists, agent may need training)
|-- Match score 0.50-0.80 → PARTIAL (related article exists, may need expansion)
|-- Match score < 0.50 → GAP (no relevant article - needs authoring)
|
v
[Aggregate: Group GAPs by topic, count frequency]
|
v
[Knowledge Gap Report → ranked by frequency × impact]
2. Extracting Unresolved Interaction Transcripts
import requests
from datetime import datetime, timedelta
GENESYS_API = "https://api.mypurecloud.com"
def get_unresolved_transcripts(access_token: str, days_back: int = 1) -> list[dict]:
"""
Retrieves transcripts from interactions marked with escalation/unresolved wrap-up codes.
"""
end = datetime.utcnow()
start = end - timedelta(days=days_back)
headers = {
"Authorization": f"Bearer {access_token}",
"Content-Type": "application/json"
}
# Query conversations with specific wrap-up codes indicating unresolved
query = {
"interval": f"{start.strftime('%Y-%m-%dT%H:%M:%S.000Z')}/{end.strftime('%Y-%m-%dT%H:%M:%S.000Z')}",
"order": "asc",
"orderBy": "conversationStart",
"segmentFilters": [
{
"type": "or",
"predicates": [
{"type": "dimension", "dimension": "wrapUpCode", "value": "Escalated"},
{"type": "dimension", "dimension": "wrapUpCode", "value": "Unresolved"},
{"type": "dimension", "dimension": "wrapUpCode", "value": "KnowledgeGap"}
]
}
]
}
resp = requests.post(
f"{GENESYS_API}/api/v2/analytics/conversations/details/query",
headers=headers, json=query
)
conversations = resp.json().get("conversations", [])
results = []
for conv in conversations:
conv_id = conv["conversationId"]
# Retrieve the transcript
transcript_resp = requests.get(
f"{GENESYS_API}/api/v2/conversations/{conv_id}/transcripts",
headers=headers
)
if transcript_resp.status_code == 200:
transcript_data = transcript_resp.json()
# Extract only customer utterances
customer_utterances = []
for phrase in transcript_data.get("communicationTranscripts", [{}])[0].get("phrases", []):
if phrase.get("participantPurpose") == "customer":
customer_utterances.append(phrase.get("text", ""))
results.append({
"conversation_id": conv_id,
"customer_text": " ".join(customer_utterances),
"queue": conv.get("participants", [{}])[0].get("sessions", [{}])[0].get("metrics", [{}])[0].get("emitDate"),
"wrap_up": "Escalated"
})
return results
3. Question Extraction Using NLP
import spacy
import re
nlp = spacy.load("en_core_web_sm")
QUESTION_PATTERNS = [
r"how (?:do|can|should|would) (?:I|we|you)\b",
r"what (?:is|are|does|do|happens)\b",
r"where (?:is|can|do)\b",
r"why (?:is|does|did|can't|won't)\b",
r"is (?:there|it|this)\b",
r"can (?:I|you|we|someone)\b",
r"do (?:you|I|we) (?:have|know|offer|support)\b",
]
def extract_questions(customer_text: str) -> list[str]:
"""
Extracts question-like utterances from customer transcript text.
Uses both pattern matching and sentence-level classification.
"""
doc = nlp(customer_text)
questions = []
for sent in doc.sents:
text = sent.text.strip()
# Direct question detection
if text.endswith("?"):
questions.append(text)
continue
# Pattern-based question detection (customers often phrase questions as statements)
for pattern in QUESTION_PATTERNS:
if re.search(pattern, text, re.IGNORECASE):
questions.append(text)
break
# Deduplicate near-identical questions
unique = []
for q in questions:
if not any(similar(q, existing) > 0.85 for existing in unique):
unique.append(q)
return unique
def similar(a: str, b: str) -> float:
"""Simple Jaccard similarity for deduplication."""
set_a = set(a.lower().split())
set_b = set(b.lower().split())
if not set_a or not set_b:
return 0.0
return len(set_a & set_b) / len(set_a | set_b)
4. Knowledge Base Gap Scoring via Vector Search
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
qdrant = QdrantClient(url="http://localhost:6333")
KB_COLLECTION = "knowledge_base_articles"
def classify_knowledge_coverage(questions: list[str]) -> list[dict]:
"""
For each extracted question, searches the knowledge base and classifies coverage.
"""
results = []
for question in questions:
embedding = embedding_model.encode(question).tolist()
search_results = qdrant.search(
collection_name=KB_COLLECTION,
query_vector=embedding,
limit=3
)
top_score = search_results[0].score if search_results else 0.0
top_article = search_results[0].payload.get("title", "N/A") if search_results else "N/A"
if top_score > 0.80:
coverage = "COVERED"
elif top_score > 0.50:
coverage = "PARTIAL"
else:
coverage = "GAP"
results.append({
"question": question,
"coverage": coverage,
"best_match_score": round(top_score, 3),
"best_match_article": top_article,
})
return results
def generate_gap_report(days_back: int = 7) -> list[dict]:
"""
Full pipeline: extract transcripts → extract questions → classify → aggregate gaps.
"""
token = get_genesys_token()
transcripts = get_unresolved_transcripts(token, days_back=days_back)
all_gaps = []
for transcript in transcripts:
questions = extract_questions(transcript["customer_text"])
classified = classify_knowledge_coverage(questions)
for item in classified:
if item["coverage"] in ("GAP", "PARTIAL"):
item["conversation_id"] = transcript["conversation_id"]
all_gaps.append(item)
# Aggregate: group similar gaps and count frequency
gap_clusters = cluster_gaps(all_gaps)
# Sort by frequency (most-asked unanswered questions first)
gap_clusters.sort(key=lambda x: x["frequency"], reverse=True)
return gap_clusters
def cluster_gaps(gaps: list[dict]) -> list[dict]:
"""Groups similar gap questions and counts frequency."""
clusters = []
for gap in gaps:
merged = False
for cluster in clusters:
if similar(gap["question"], cluster["representative_question"]) > 0.60:
cluster["frequency"] += 1
cluster["conversation_ids"].append(gap["conversation_id"])
merged = True
break
if not merged:
clusters.append({
"representative_question": gap["question"],
"coverage": gap["coverage"],
"best_match_score": gap["best_match_score"],
"frequency": 1,
"conversation_ids": [gap["conversation_id"]]
})
return clusters
Validation, Edge Cases & Troubleshooting
Edge Case 1: Customer Rambles Without Asking a Clear Question
Many unresolved interactions contain long customer monologues that describe a problem but never ask a direct question. The question extractor finds nothing.
Solution: Add a fallback: if no explicit questions are detected, extract the topic of the customer’s text using a keyword extraction model (KeyBERT or TF-IDF). Compare the topic embedding against the knowledge base. This captures “I’m having trouble with my billing statement” even though it’s not phrased as a question.
Edge Case 2: Knowledge Article Exists But Agent Didn’t Find It
The gap report flags “How do I reset my password?” as a GAP, but you know the article exists - the agent just didn’t search for it.
Solution: When coverage = COVERED (score > 0.80), still track these as “agent discovery failures” in a separate report. This triggers a training intervention (improve agent search habits) rather than a content authoring task. The gap pipeline should distinguish between “content doesn’t exist” and “content exists but wasn’t used.”
Edge Case 3: Seasonal Spikes Create False Knowledge Gaps
During tax season, your financial services contact center gets flooded with “How do I download my 1099?” questions. The gap pipeline flags this as the #1 knowledge gap every January, even though a seasonal article already exists but wasn’t re-published.
Solution: Add a “seasonal tag” to knowledge articles. During gap analysis, include articles tagged with the current season/quarter even if they were archived. Implement an automated article re-activation schedule that publishes seasonal content 2 weeks before the expected spike.