Implementing Retrieval-Augmented Generation (RAG) Pipelines with Guardrails for Agent Copilot

Implementing Retrieval-Augmented Generation (RAG) Pipelines with Guardrails for Agent Copilot

What This Guide Covers

You are building a production-grade Retrieval-Augmented Generation (RAG) pipeline that powers an Agent Copilot sidebar in Genesys Cloud-giving agents real-time, grounded knowledge article suggestions and next-best-response recommendations during live interactions. When complete, your RAG system will retrieve the top-k most semantically relevant knowledge articles from a vector store based on the live conversation transcript, inject them as context into an LLM prompt, generate a suggested response, apply safety guardrails to prevent hallucination and policy violations, and deliver the suggestion to the agent’s desktop within 1.5 seconds of each customer utterance-fast enough to be genuinely useful during a live conversation.


Prerequisites, Roles & Licensing

  • Genesys Cloud: CX 2 or 3 with Digital channels or Genesys Agent Assist.
  • Infrastructure:
    • A vector database (Pinecone, Weaviate, pgvector, or Qdrant).
    • An embedding model (OpenAI text-embedding-3-small or sentence-transformers/all-MiniLM-L6-v2).
    • An LLM inference endpoint (OpenAI GPT-4o-mini, Anthropic Claude Haiku, or self-hosted Qwen via Ollama).
    • A FastAPI service acting as the RAG backend.
    • Genesys Cloud Notification API subscription for real-time conversation transcript events.

The Implementation Deep-Dive

1. Why RAG Rather than Fine-Tuning?

Approach Update Frequency Risk of Hallucination Knowledge Freshness
Fine-tuning Requires model retraining (weeks/monthly) Medium - model may confabulate when off-domain Stale between retraining cycles
RAG Update vector store in real time Low - answers are grounded in retrieved documents Always fresh - new articles indexed immediately

For a contact center knowledge base that changes frequently (product updates, policy changes, SLA adjustments), RAG is the correct architecture. Fine-tuning is reserved for teaching the model how to respond (tone, format, brand voice), not what to know.


2. Knowledge Base Indexing Pipeline

from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import hashlib, uuid

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
qdrant = QdrantClient(url="http://localhost:6333")

COLLECTION_NAME = "contact_center_knowledge"

def setup_collection():
    qdrant.recreate_collection(
        collection_name=COLLECTION_NAME,
        vectors_config=VectorParams(size=384, distance=Distance.COSINE)
    )

def index_knowledge_article(article_id: str, title: str, content: str, category: str, tags: list):
    """
    Chunks and indexes a knowledge article into the vector store.
    Articles are chunked into 400-token overlapping segments for better retrieval precision.
    """
    chunks = chunk_text(content, max_tokens=400, overlap=50)
    
    points = []
    for i, chunk in enumerate(chunks):
        embedding = embedding_model.encode(chunk).tolist()
        chunk_id = str(uuid.uuid4())
        
        points.append(PointStruct(
            id=chunk_id,
            vector=embedding,
            payload={
                "article_id": article_id,
                "title": title,
                "category": category,
                "tags": tags,
                "chunk_index": i,
                "chunk_text": chunk,
                "total_chunks": len(chunks)
            }
        ))
    
    qdrant.upsert(collection_name=COLLECTION_NAME, points=points)
    print(f"Indexed article '{title}' → {len(chunks)} chunks")

def chunk_text(text: str, max_tokens: int = 400, overlap: int = 50) -> list[str]:
    """Simple word-based chunker with overlap."""
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunk = ' '.join(words[i:i + max_tokens])
        chunks.append(chunk)
        i += (max_tokens - overlap)
    return chunks

3. The RAG Retrieval and Generation Pipeline

import openai
from qdrant_client.models import SearchRequest

openai_client = openai.AsyncOpenAI()

async def generate_agent_suggestion(
    conversation_transcript: str,
    customer_last_utterance: str,
    agent_context: dict  # queue name, customer tier, interaction metadata
) -> dict:
    """
    Full RAG pipeline: retrieve → augment → generate → guardrail → return.
    Target latency: < 1.5 seconds end-to-end.
    """
    
    # Step 1: Embed the customer's last utterance (not the full transcript)
    # Using just the last utterance keeps retrieval focused on the immediate need
    query_embedding = embedding_model.encode(customer_last_utterance).tolist()
    
    # Step 2: Retrieve top-5 most relevant knowledge chunks
    search_results = qdrant.search(
        collection_name=COLLECTION_NAME,
        query_vector=query_embedding,
        limit=5,
        score_threshold=0.65  # Minimum relevance score - don't retrieve unrelated docs
    )
    
    if not search_results:
        return {
            "suggestion": None,
            "confidence": 0.0,
            "reason": "No relevant knowledge articles found for this query.",
            "sources": []
        }
    
    # Step 3: Build the augmented prompt
    retrieved_context = "\n\n".join([
        f"[Article: {r.payload['title']}]\n{r.payload['chunk_text']}"
        for r in search_results
    ])
    
    system_prompt = f"""You are an expert contact center agent assistant for {agent_context.get('company_name', 'our company')}.
Your role is to provide brief, accurate response suggestions to agents handling customer inquiries.

RULES:
1. Base your suggestion ONLY on the provided knowledge articles - do not add information not found in the articles.
2. If the articles don't contain sufficient information to answer the customer's question, say so explicitly.
3. Keep suggestions under 3 sentences - agents need concise guidance, not essays.
4. Write in a professional, empathetic tone matching our brand voice.
5. Never suggest actions outside agent authority (e.g., issuing credits over $50 requires supervisor approval).

KNOWLEDGE ARTICLES:
{retrieved_context}

CUSTOMER TIER: {agent_context.get('customer_tier', 'standard')}
QUEUE: {agent_context.get('queue_name', 'General Support')}"""
    
    user_prompt = f"""Recent conversation:
{conversation_transcript[-1500:]}  # Last 1500 chars to stay within context window

Customer just said: "{customer_last_utterance}"

Provide a suggested response for the agent:"""
    
    # Step 4: Generate (with timeout budget - 1 second max for LLM call)
    response = await openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        max_tokens=150,
        temperature=0.3,  # Low temperature for factual, consistent responses
        timeout=1.0
    )
    
    suggestion = response.choices[0].message.content.strip()
    
    # Step 5: Apply guardrails
    guardrail_result = apply_guardrails(suggestion, agent_context)
    if not guardrail_result["passed"]:
        return {
            "suggestion": None,
            "confidence": 0.0,
            "reason": f"Guardrail blocked: {guardrail_result['reason']}",
            "sources": []
        }
    
    return {
        "suggestion": suggestion,
        "confidence": round(search_results[0].score, 3),
        "sources": [{"title": r.payload["title"], "score": round(r.score, 3)} for r in search_results[:3]],
        "reason": "Generated from knowledge base"
    }

4. Guardrails - Preventing Hallucination and Policy Violations

import re

POLICY_VIOLATION_PATTERNS = [
    (r'\$\d{3,}', "Suggests credit/refund over $100 - requires supervisor"),
    (r'guarantee|guaranteed', "Absolute guarantees require manager approval"),
    (r'sue|legal action|attorney', "Legal threat references require escalation to Legal"),
    (r'confidential|internal only', "Potential internal data exposure"),
]

def apply_guardrails(suggestion: str, agent_context: dict) -> dict:
    """
    Applies safety guardrails to the generated suggestion.
    Returns {"passed": bool, "reason": str}.
    """
    suggestion_lower = suggestion.lower()
    
    # Check policy violation patterns
    for pattern, reason in POLICY_VIOLATION_PATTERNS:
        if re.search(pattern, suggestion, re.IGNORECASE):
            return {"passed": False, "reason": reason}
    
    # Check for explicit uncertainty indicators (good - model is being honest)
    # Don't block these, but flag for transparency
    has_uncertainty = any(phrase in suggestion_lower for phrase in [
        "i'm not sure", "i don't have information", "you should consult",
        "based on the information available"
    ])
    
    # Minimum quality check - don't send empty or near-empty suggestions
    if len(suggestion.strip()) < 20:
        return {"passed": False, "reason": "Suggestion too short to be useful"}
    
    return {"passed": True, "reason": "passed"}

Validation, Edge Cases & Troubleshooting

Edge Case 1: Retrieval Returns High-Score but Wrong-Category Articles

The similarity search returns a high-scoring article about “billing dispute resolution” when the customer is asking about a “technical connectivity issue.” Both mention “account” and “resolve” heavily, causing false retrieval.
Solution: Add category filtering to the vector search. If the IVR/bot has already classified the interaction type (technical, billing, retention), apply a metadata filter to restrict retrieval to the relevant category: filter={"category": {"$in": ["technical", "general"]}}.

Edge Case 2: RAG Latency Exceeds 1.5 Seconds During High Traffic

Under load, the vector search + LLM call chain takes 2.5+ seconds. By the time the suggestion arrives, the agent has already responded.
Solution: Implement speculative generation: pre-generate suggestions for the 5 most common intents as soon as the interaction starts (predicted from IVR selections or opening message). Cache them. If the customer’s actual question matches a pre-generated suggestion, serve it instantly from cache.

Edge Case 3: Agent Receives Conflicting Suggestions from Multiple Knowledge Articles

Article A says the refund window is 30 days. Article B (an older version) says 14 days. The RAG system retrieves both and the LLM synthesizes an uncertain “between 14 and 30 days” response.
Solution: Implement knowledge article versioning with a last_updated metadata field. During retrieval, prefer more recently updated articles using a recency boost in the similarity score: final_score = cosine_score * 0.8 + recency_score * 0.2. Archive (don’t delete) old articles to preserve the index but suppress them from production retrieval.

Official References