Designing Advanced RAG (Retrieval-Augmented Generation) Knowledge Architectures for Agent Assist

Designing Advanced RAG (Retrieval-Augmented Generation) Knowledge Architectures for Agent Assist

What This Guide Covers

You are building a production-grade Retrieval-Augmented Generation (RAG) system that surfaces contextually relevant knowledge articles to agents in real time - during an active conversation - by combining live transcript analysis with semantic search across your proprietary knowledge base, then generating a concise, grounded answer that cites the exact source articles. When complete, agents receive a recommended response within 2 seconds of the customer completing a query, the response is grounded exclusively in your approved knowledge base (no hallucination), and every recommendation includes a clickable source citation the agent can verify before sending.


Prerequisites, Roles & Licensing

  • Genesys Cloud: CX 3 with Knowledge and Agent Assist enabled; or CX 2 with a third-party Agent Assist integration via the Agent UI SDK
  • Permissions:
    • Knowledge > Knowledge > View
    • Conversations > Conversation > View (for transcript streaming)
  • RAG infrastructure: A vector database (Pinecone, Weaviate, Qdrant, or pgvector on RDS) + an LLM API (OpenAI GPT-4o, Anthropic Claude 3, or a self-hosted model)
  • Embedding model: OpenAI text-embedding-3-large, Cohere embed-english-v3.0, or a domain-fine-tuned sentence transformer

The Implementation Deep-Dive

1. Why Standard Knowledge Search Fails at Scale

Genesys Cloud’s native Knowledge API uses keyword and BM25 search - highly effective for exact term matches, but poor at semantic similarity. When an agent types “customer says their internet keeps dropping,” the keyword search returns articles containing “dropping” but misses the semantically identical “intermittent connectivity issues,” “network instability,” and “connection resets.” RAG with vector embeddings closes this gap: it retrieves articles based on meaning, not word overlap.

The two RAG components:

Component Purpose
Retrieval Semantic search across the knowledge base - returns the top-K most relevant chunks
Generation LLM synthesizes a concise, grounded answer using retrieved chunks as context

The critical architecture principle: the LLM only generates answers using retrieved content as context. It does not answer from its training data. This eliminates hallucination for domain-specific questions.


2. Knowledge Base Ingestion and Chunking

Before retrieval can work, your knowledge base articles must be chunked and embedded into a vector database:

import openai
from pinecone import Pinecone, ServerlessSpec
import hashlib
import json

client = openai.OpenAI()
pc = Pinecone(api_key="your-pinecone-key")

INDEX_NAME = "agent-assist-kb"
EMBEDDING_MODEL = "text-embedding-3-large"
EMBEDDING_DIMS = 3072

def create_index_if_not_exists():
    if INDEX_NAME not in [i.name for i in pc.list_indexes()]:
        pc.create_index(
            name=INDEX_NAME,
            dimension=EMBEDDING_DIMS,
            metric="cosine",
            spec=ServerlessSpec(cloud="aws", region="us-east-1")
        )

def chunk_article(article: dict, chunk_size: int = 400, overlap: int = 50) -> list[dict]:
    """
    Split a knowledge article into overlapping chunks for granular retrieval.
    Smaller chunks = more precise retrieval; larger chunks = more context per result.
    400 tokens is the recommended sweet spot for contact center QA-style articles.
    """
    title = article.get("title", "")
    content = article.get("variations", [{}])[0].get("rawHtml", {}).get("content", "")
    
    # Strip HTML tags for embedding
    import re
    clean_content = re.sub(r'<[^>]+>', ' ', content)
    clean_content = re.sub(r'\s+', ' ', clean_content).strip()
    
    # Split into word-level tokens (approximate)
    words = clean_content.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk_words = words[i:i + chunk_size]
        chunk_text = f"Article: {title}\n\n{' '.join(chunk_words)}"
        chunk_id = hashlib.md5(f"{article['id']}:{i}".encode()).hexdigest()
        
        chunks.append({
            "id": chunk_id,
            "text": chunk_text,
            "metadata": {
                "article_id": article["id"],
                "article_title": title,
                "category_id": article.get("categoryId"),
                "chunk_index": i,
                "source_url": article.get("externalUrl", ""),
                "last_modified": article.get("dateModified", "")
            }
        })
    
    return chunks

def embed_and_upsert_articles(articles: list[dict]):
    """Embed all articles and upsert into Pinecone."""
    index = pc.Index(INDEX_NAME)
    all_chunks = []
    
    for article in articles:
        chunks = chunk_article(article)
        all_chunks.extend(chunks)
    
    # Batch embed (OpenAI allows up to 2048 inputs per request)
    batch_size = 100
    for i in range(0, len(all_chunks), batch_size):
        batch = all_chunks[i:i + batch_size]
        texts = [c["text"] for c in batch]
        
        resp = client.embeddings.create(
            input=texts,
            model=EMBEDDING_MODEL
        )
        
        vectors = [
            (chunk["id"], resp.data[j].embedding, chunk["metadata"])
            for j, chunk in enumerate(batch)
        ]
        
        index.upsert(vectors=vectors)
        print(f"Upserted {i + len(batch)}/{len(all_chunks)} chunks.")

3. Real-Time Query Pipeline

The RAG pipeline is triggered each time the customer’s speech segment completes (via speech analytics or notification API transcript events):

async def rag_agent_assist(
    customer_utterance: str,
    conversation_context: str,  # Last 3-4 turns of conversation
    top_k: int = 5,
    min_score: float = 0.72
) -> dict:
    """
    Given the customer's latest utterance, retrieve relevant knowledge and generate a grounded answer.
    Returns: { answer, sources, confidence, latency_ms }
    """
    import time
    start = time.time()
    
    index = pc.Index(INDEX_NAME)
    
    # Step 1: Embed the query
    query_text = f"Customer question: {customer_utterance}\n\nConversation context: {conversation_context[-500:]}"
    
    embed_resp = client.embeddings.create(
        input=[query_text],
        model=EMBEDDING_MODEL
    )
    query_vector = embed_resp.data[0].embedding
    
    # Step 2: Retrieve top-K semantically similar chunks
    results = index.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True
    )
    
    # Filter by minimum relevance score
    relevant_chunks = [
        r for r in results.matches
        if r.score >= min_score
    ]
    
    if not relevant_chunks:
        return {
            "answer": None,
            "sources": [],
            "confidence": "LOW",
            "latency_ms": int((time.time() - start) * 1000),
            "reason": "No relevant articles found above confidence threshold."
        }
    
    # Step 3: Build context for LLM generation
    context_blocks = []
    sources = []
    
    for i, match in enumerate(relevant_chunks):
        context_blocks.append(f"[Source {i+1}] {match.metadata['article_title']}:\n{match.metadata.get('text_preview', '')}")
        sources.append({
            "articleId": match.metadata["article_id"],
            "title": match.metadata["article_title"],
            "relevanceScore": round(match.score, 3),
            "sourceUrl": match.metadata.get("source_url", "")
        })
    
    context = "\n\n---\n\n".join(context_blocks)
    
    # Step 4: Generate grounded answer
    generation_resp = client.chat.completions.create(
        model="gpt-4o",
        temperature=0.1,  # Low temperature for factual, consistent answers
        max_tokens=300,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an agent assist system for a contact center. "
                    "Your job is to provide concise, accurate answers to agents helping customers. "
                    "ONLY use information from the provided knowledge base sources. "
                    "If the sources don't contain the answer, say so explicitly. "
                    "DO NOT add information from your training data. "
                    "Reference sources by number (e.g., 'According to Source 1...')."
                )
            },
            {
                "role": "user",
                "content": (
                    f"Customer said: \"{customer_utterance}\"\n\n"
                    f"Knowledge Base Sources:\n{context}\n\n"
                    f"Provide a concise agent response recommendation (2-4 sentences max)."
                )
            }
        ]
    )
    
    answer = generation_resp.choices[0].message.content
    
    return {
        "answer": answer,
        "sources": sources[:3],  # Return top 3 most relevant sources
        "confidence": "HIGH" if relevant_chunks[0].score >= 0.85 else "MEDIUM",
        "latency_ms": int((time.time() - start) * 1000)
    }

The Trap - generating answers without grounding constraints: Without an explicit system prompt instructing the LLM to use only the provided sources, GPT-4o will confidently blend its training knowledge with the retrieved content. In a contact center, this produces plausible-sounding but factually incorrect answers about your specific products, pricing, and policies. The system prompt’s ONLY use information from the provided knowledge base sources instruction is the critical guardrail - test it by asking questions about information not in your KB and verifying the model declines to answer.


4. Latency Optimization for Real-Time Agent Assist

The full RAG pipeline (embed query + vector search + LLM generation) must complete in under 2 seconds for real-time use. Optimize each stage:

Embedding latency (~150ms): Pre-warm the embedding connection by making a dummy call at service startup. Use OpenAI’s async client (AsyncOpenAI) for non-blocking embedding calls during conversations.

Vector search latency (~50ms): Pinecone serverless responds in 30-80ms at p99. Use a lower top_k (3 instead of 10) during the live assist path - you can always expand for background processing. Enable Pinecone’s list index type for exact nearest-neighbor (rather than approximate) for higher recall on small KBs (<100K vectors).

LLM generation latency (~600-900ms): Use streaming to show the answer as it generates rather than waiting for completion:

async def stream_rag_answer(customer_utterance: str, context: str, websocket):
    """Stream the LLM answer token-by-token to the agent desktop WebSocket."""
    stream = await client.chat.completions.create(
        model="gpt-4o",
        stream=True,
        temperature=0.1,
        max_tokens=200,
        messages=[
            {"role": "system", "content": AGENT_ASSIST_SYSTEM_PROMPT},
            {"role": "user", "content": f"Customer: {customer_utterance}\n\nSources: {context}"}
        ]
    )
    
    full_answer = ""
    async for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        full_answer += delta
        await websocket.send_json({"type": "answer_chunk", "content": delta})
    
    await websocket.send_json({"type": "answer_complete", "full_answer": full_answer})

With streaming, the agent sees the first words of the answer within 300ms while the rest generates - dramatically better perceived responsiveness than waiting for the full response.


5. Knowledge Base Synchronization

The vector index must stay in sync with your Genesys Cloud Knowledge Base. When articles are created, updated, or deleted, the index must update:

def sync_article_to_vector_index(article_id: str, action: str, access_token: str, base_url: str):
    """
    Sync a single article change to the vector index.
    action: "upsert" | "delete"
    """
    index = pc.Index(INDEX_NAME)
    
    if action == "delete":
        # Delete all chunks for this article
        index.delete(filter={"article_id": {"$eq": article_id}})
        return
    
    # Fetch updated article from Genesys Cloud
    resp = requests.get(
        f"{base_url}/api/v2/knowledge/knowledgebases/{KB_ID}/documents/{article_id}",
        headers={"Authorization": f"Bearer {access_token}"}
    )
    resp.raise_for_status()
    article = resp.json()
    
    # Delete old chunks and re-embed
    index.delete(filter={"article_id": {"$eq": article_id}})
    embed_and_upsert_articles([article])

Trigger this via a Genesys Cloud Notification Service subscription to knowledge base change events, or via a nightly reconciliation job that compares article modification dates against the last sync timestamp.


Validation, Edge Cases & Troubleshooting

Edge Case 1: Hallucination Despite Grounding Instructions

Even with strict grounding instructions, LLMs occasionally introduce details not present in the retrieved sources - particularly when the retrieved chunks are partially relevant and the model “fills in the gaps.” Implement a post-generation fact check: extract key factual claims from the generated answer (using a second, lightweight LLM call) and verify each claim is supported by a substring of the retrieved chunks. If a claim is unsupported, replace the answer with the retrieved chunk text directly.

Edge Case 2: Stale Vectors After Knowledge Base Update

If an article is updated in Genesys Cloud Knowledge but the Notification Service webhook fails to fire (network issue, webhook timeout), the vector index contains stale embeddings. Run a nightly reconciliation: compare the dateModified of all knowledge articles against a DynamoDB sync log. Articles modified more recently than their last sync timestamp are re-embedded. This catches any webhook delivery failures.

Edge Case 3: Long Conversations Generating Overly Broad Queries

After 20+ turns of conversation, the “conversation context” passed to the RAG query becomes a long, unfocused block of text. The embedding of a long mixed-topic context produces a diffuse query vector that retrieves mediocre results. Use only the last 2 turns of conversation as context, not the full history. For long conversations, extract a running summary (via a lightweight LLM call every 5 turns) and use the summary + last 2 turns as the query context.

Edge Case 4: Agent Desktop Latency Perception

Even at 1.5 seconds total latency, agents perceive RAG answers as slow if the UI shows a blank panel while waiting. Implement a “thinking” indicator that appears immediately when the customer stops speaking, and use streaming to show the answer as it generates. In user testing, streaming at 500ms for first token dramatically outperforms a 1.2-second wait for the full answer in perceived responsiveness - even though total latency is similar.


Official References