Implementing Knowledge Gap Analysis Pipelines Using Unresolved Interaction Transcript Mining

Implementing Knowledge Gap Analysis Pipelines Using Unresolved Interaction Transcript Mining

What This Guide Covers

You are building an automated pipeline that mines unresolved interaction transcripts - conversations where agents couldn’t find an answer, escalated to a supervisor, or where the customer’s issue remained open - and identifies the specific knowledge gaps in your knowledge base. When complete, your system will nightly process all interactions marked with “unresolved” or “escalated” wrap-up codes, extract the customer’s core questions using NLP entity/intent extraction, compare those questions against your existing knowledge base articles using semantic similarity search, identify questions with no matching article (true gaps), and generate a prioritized “Knowledge Gap Report” showing exactly which articles need to be written, ranked by frequency of the unanswered question.


Prerequisites, Roles & Licensing

  • Genesys Cloud: CX 3 with Speech and Text Analytics or the Transcription add-on.
  • Infrastructure:
    • Access to the Analytics API for conversation transcripts
    • A vector database (Qdrant, Pinecone, or pgvector) indexing your existing knowledge base
    • An NLP pipeline for question extraction (spaCy + custom rules, or LLM-based)
    • A reporting destination (Snowflake, BigQuery, or a simple PostgreSQL table)

The Implementation Deep-Dive

1. The Knowledge Gap Detection Architecture

[Genesys Cloud: Unresolved Interactions (wrap-up = "escalated" or "unresolved")]
    |
    v (Analytics API: nightly batch query)
[Extract Transcripts: customer utterances only]
    |
    v
[NLP: Extract customer questions and core intent phrases]
    |
    v
[Vector Search: Compare each question against knowledge base embeddings]
    |
    |-- Match score > 0.80 → COVERED (article exists, agent may need training)
    |-- Match score 0.50-0.80 → PARTIAL (related article exists, may need expansion)
    |-- Match score < 0.50 → GAP (no relevant article - needs authoring)
    |
    v
[Aggregate: Group GAPs by topic, count frequency]
    |
    v
[Knowledge Gap Report → ranked by frequency × impact]

2. Extracting Unresolved Interaction Transcripts

import requests
from datetime import datetime, timedelta

GENESYS_API = "https://api.mypurecloud.com"

def get_unresolved_transcripts(access_token: str, days_back: int = 1) -> list[dict]:
    """
    Retrieves transcripts from interactions marked with escalation/unresolved wrap-up codes.
    """
    end = datetime.utcnow()
    start = end - timedelta(days=days_back)
    
    headers = {
        "Authorization": f"Bearer {access_token}",
        "Content-Type": "application/json"
    }
    
    # Query conversations with specific wrap-up codes indicating unresolved
    query = {
        "interval": f"{start.strftime('%Y-%m-%dT%H:%M:%S.000Z')}/{end.strftime('%Y-%m-%dT%H:%M:%S.000Z')}",
        "order": "asc",
        "orderBy": "conversationStart",
        "segmentFilters": [
            {
                "type": "or",
                "predicates": [
                    {"type": "dimension", "dimension": "wrapUpCode", "value": "Escalated"},
                    {"type": "dimension", "dimension": "wrapUpCode", "value": "Unresolved"},
                    {"type": "dimension", "dimension": "wrapUpCode", "value": "KnowledgeGap"}
                ]
            }
        ]
    }
    
    resp = requests.post(
        f"{GENESYS_API}/api/v2/analytics/conversations/details/query",
        headers=headers, json=query
    )
    conversations = resp.json().get("conversations", [])
    
    results = []
    for conv in conversations:
        conv_id = conv["conversationId"]
        
        # Retrieve the transcript
        transcript_resp = requests.get(
            f"{GENESYS_API}/api/v2/conversations/{conv_id}/transcripts",
            headers=headers
        )
        
        if transcript_resp.status_code == 200:
            transcript_data = transcript_resp.json()
            
            # Extract only customer utterances
            customer_utterances = []
            for phrase in transcript_data.get("communicationTranscripts", [{}])[0].get("phrases", []):
                if phrase.get("participantPurpose") == "customer":
                    customer_utterances.append(phrase.get("text", ""))
            
            results.append({
                "conversation_id": conv_id,
                "customer_text": " ".join(customer_utterances),
                "queue": conv.get("participants", [{}])[0].get("sessions", [{}])[0].get("metrics", [{}])[0].get("emitDate"),
                "wrap_up": "Escalated"
            })
    
    return results

3. Question Extraction Using NLP

import spacy
import re

nlp = spacy.load("en_core_web_sm")

QUESTION_PATTERNS = [
    r"how (?:do|can|should|would) (?:I|we|you)\b",
    r"what (?:is|are|does|do|happens)\b",
    r"where (?:is|can|do)\b",
    r"why (?:is|does|did|can't|won't)\b",
    r"is (?:there|it|this)\b",
    r"can (?:I|you|we|someone)\b",
    r"do (?:you|I|we) (?:have|know|offer|support)\b",
]

def extract_questions(customer_text: str) -> list[str]:
    """
    Extracts question-like utterances from customer transcript text.
    Uses both pattern matching and sentence-level classification.
    """
    doc = nlp(customer_text)
    questions = []
    
    for sent in doc.sents:
        text = sent.text.strip()
        
        # Direct question detection
        if text.endswith("?"):
            questions.append(text)
            continue
        
        # Pattern-based question detection (customers often phrase questions as statements)
        for pattern in QUESTION_PATTERNS:
            if re.search(pattern, text, re.IGNORECASE):
                questions.append(text)
                break
    
    # Deduplicate near-identical questions
    unique = []
    for q in questions:
        if not any(similar(q, existing) > 0.85 for existing in unique):
            unique.append(q)
    
    return unique

def similar(a: str, b: str) -> float:
    """Simple Jaccard similarity for deduplication."""
    set_a = set(a.lower().split())
    set_b = set(b.lower().split())
    if not set_a or not set_b:
        return 0.0
    return len(set_a & set_b) / len(set_a | set_b)

4. Knowledge Base Gap Scoring via Vector Search

from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
qdrant = QdrantClient(url="http://localhost:6333")

KB_COLLECTION = "knowledge_base_articles"

def classify_knowledge_coverage(questions: list[str]) -> list[dict]:
    """
    For each extracted question, searches the knowledge base and classifies coverage.
    """
    results = []
    
    for question in questions:
        embedding = embedding_model.encode(question).tolist()
        
        search_results = qdrant.search(
            collection_name=KB_COLLECTION,
            query_vector=embedding,
            limit=3
        )
        
        top_score = search_results[0].score if search_results else 0.0
        top_article = search_results[0].payload.get("title", "N/A") if search_results else "N/A"
        
        if top_score > 0.80:
            coverage = "COVERED"
        elif top_score > 0.50:
            coverage = "PARTIAL"
        else:
            coverage = "GAP"
        
        results.append({
            "question": question,
            "coverage": coverage,
            "best_match_score": round(top_score, 3),
            "best_match_article": top_article,
        })
    
    return results

def generate_gap_report(days_back: int = 7) -> list[dict]:
    """
    Full pipeline: extract transcripts → extract questions → classify → aggregate gaps.
    """
    token = get_genesys_token()
    transcripts = get_unresolved_transcripts(token, days_back=days_back)
    
    all_gaps = []
    for transcript in transcripts:
        questions = extract_questions(transcript["customer_text"])
        classified = classify_knowledge_coverage(questions)
        
        for item in classified:
            if item["coverage"] in ("GAP", "PARTIAL"):
                item["conversation_id"] = transcript["conversation_id"]
                all_gaps.append(item)
    
    # Aggregate: group similar gaps and count frequency
    gap_clusters = cluster_gaps(all_gaps)
    
    # Sort by frequency (most-asked unanswered questions first)
    gap_clusters.sort(key=lambda x: x["frequency"], reverse=True)
    
    return gap_clusters

def cluster_gaps(gaps: list[dict]) -> list[dict]:
    """Groups similar gap questions and counts frequency."""
    clusters = []
    
    for gap in gaps:
        merged = False
        for cluster in clusters:
            if similar(gap["question"], cluster["representative_question"]) > 0.60:
                cluster["frequency"] += 1
                cluster["conversation_ids"].append(gap["conversation_id"])
                merged = True
                break
        
        if not merged:
            clusters.append({
                "representative_question": gap["question"],
                "coverage": gap["coverage"],
                "best_match_score": gap["best_match_score"],
                "frequency": 1,
                "conversation_ids": [gap["conversation_id"]]
            })
    
    return clusters

Validation, Edge Cases & Troubleshooting

Edge Case 1: Customer Rambles Without Asking a Clear Question

Many unresolved interactions contain long customer monologues that describe a problem but never ask a direct question. The question extractor finds nothing.
Solution: Add a fallback: if no explicit questions are detected, extract the topic of the customer’s text using a keyword extraction model (KeyBERT or TF-IDF). Compare the topic embedding against the knowledge base. This captures “I’m having trouble with my billing statement” even though it’s not phrased as a question.

Edge Case 2: Knowledge Article Exists But Agent Didn’t Find It

The gap report flags “How do I reset my password?” as a GAP, but you know the article exists - the agent just didn’t search for it.
Solution: When coverage = COVERED (score > 0.80), still track these as “agent discovery failures” in a separate report. This triggers a training intervention (improve agent search habits) rather than a content authoring task. The gap pipeline should distinguish between “content doesn’t exist” and “content exists but wasn’t used.”

Edge Case 3: Seasonal Spikes Create False Knowledge Gaps

During tax season, your financial services contact center gets flooded with “How do I download my 1099?” questions. The gap pipeline flags this as the #1 knowledge gap every January, even though a seasonal article already exists but wasn’t re-published.
Solution: Add a “seasonal tag” to knowledge articles. During gap analysis, include articles tagged with the current season/quarter even if they were archived. Implement an automated article re-activation schedule that publishes seasonal content 2 weeks before the expected spike.

Official References