Implementing Custom Entity Recognition (NER) for Extracting Proprietary Product Codes

Implementing Custom Entity Recognition (NER) for Extracting Proprietary Product Codes

What This Guide Covers

You are building a custom Named Entity Recognition (NER) pipeline that identifies your organization’s proprietary identifiers in real-time conversation transcripts - internal product SKUs (e.g., “GTX-Pro-2000”), service plan codes (“Enterprise-Tier-3”), model numbers, internal case reference formats, and contract IDs - entities that standard off-the-shelf NER models (spaCy, AWS Comprehend) have never seen and cannot recognize. When complete, every time a customer mentions a product code in voice or chat, the agent desktop automatically populates the product details panel from your product catalog, reducing average handle time by eliminating manual product lookup.


Prerequisites, Roles & Licensing

  • Genesys Cloud: CX 2 or CX 3 with Speech and Text Analytics or a third-party NLU integration
  • NER approach: Rule-based (regex + entity dictionaries) for structured codes; ML-based (fine-tuned spaCy or a transformer) for freeform product mentions
  • Integration point: Post-call analytics pipeline, or real-time via Genesys Cloud transcript streaming API + a sidecar NER service
  • Hosting: Lambda, Cloud Run, or a container - the NER model serves <50ms inference latency for real-time use

The Implementation Deep-Dive

1. Entity Type Design for Contact Center Identifiers

Before building the NER system, formally define your entity types and their expected patterns. The more precisely defined, the higher your precision:

Entity Type Example Pattern
PRODUCT_SKU GTX-Pro-2000, XT-500-B [A-Z]{2,4}-[A-Za-z]+-\d{3,4}[A-Z]?
CONTRACT_ID CNT-2024-001234 CNT-\d{4}-\d{6}
CASE_REF CASE-98765, TKT-12345 (CASE|TKT)-\d{5,6}
SERVICE_PLAN Enterprise-Tier-3, SMB-Pro (Enterprise|SMB|Start)-[A-Za-z]+-\d?
SERIAL_NUMBER SN: 4X7H-2091-KL [0-9A-Z]{4}-\d{4}-[A-Z]{2}
EMPLOYEE_CODE EMP-00456 EMP-\d{5}

Supplement regex patterns with a curated entity dictionary - a lookup table of all valid values:

import json

# product_catalog.json - loaded from your product database
PRODUCT_CATALOG = {
    "GTX-Pro-2000": {"name": "GTX Pro 2000 Router", "category": "Networking", "status": "Active"},
    "XT-500-B": {"name": "XT 500 Blaze Modem", "category": "Modem", "status": "Discontinued"},
    "ENTERPRISE-TIER-3": {"name": "Enterprise Support Tier 3", "category": "Service Plan"}
}

Dictionary lookup is faster and more precise than regex alone - and crucially, it validates that the extracted code is a real entity, not just a string that matches the pattern.


2. Rule-Based NER with Regex + Dictionary Validation

import re
from dataclasses import dataclass
from typing import Optional

@dataclass
class EntitySpan:
    text: str
    entity_type: str
    start: int
    end: int
    confidence: float
    metadata: dict

# Ordered by specificity - more specific patterns first
ENTITY_PATTERNS = [
    ("CONTRACT_ID", r"\bCNT-\d{4}-\d{6}\b"),
    ("CASE_REF", r"\b(CASE|TKT)-\d{5,6}\b"),
    ("PRODUCT_SKU", r"\b[A-Z]{2,4}-[A-Za-z]+-\d{3,4}[A-Z]?\b"),
    ("SERVICE_PLAN", r"\b(Enterprise|SMB|Starter|Pro)-[A-Za-z]+-?\d?\b"),
    ("SERIAL_NUMBER", r"\bSN[:\s]?([0-9A-Z]{4}-\d{4}-[A-Z]{2})\b"),
    ("EMPLOYEE_CODE", r"\bEMP-\d{5}\b"),
]

def extract_entities_regex(text: str) -> list[EntitySpan]:
    """Extract entities using regex patterns."""
    entities = []
    matched_spans = []  # Track to avoid overlapping matches
    
    for entity_type, pattern in ENTITY_PATTERNS:
        for match in re.finditer(pattern, text, re.IGNORECASE):
            # Check for overlap with already matched spans
            overlap = any(
                match.start() < end and match.end() > start
                for start, end in matched_spans
            )
            if overlap:
                continue
            
            matched_text = match.group(0)
            
            # Dictionary validation for PRODUCT_SKU
            confidence = 0.85  # Base regex confidence
            metadata = {}
            
            if entity_type == "PRODUCT_SKU":
                normalized = matched_text.upper()
                if normalized in PRODUCT_CATALOG:
                    confidence = 0.99  # Dictionary-confirmed
                    metadata = PRODUCT_CATALOG[normalized]
                else:
                    confidence = 0.65  # Pattern match only - not in catalog
            
            entities.append(EntitySpan(
                text=matched_text,
                entity_type=entity_type,
                start=match.start(),
                end=match.end(),
                confidence=confidence,
                metadata=metadata
            ))
            matched_spans.append((match.start(), match.end()))
    
    # Sort by position in text
    return sorted(entities, key=lambda e: e.start)

3. ML-Based NER with Fine-Tuned spaCy for Fuzzy Matches

For product mentions that don’t follow rigid patterns (“the pro two thousand model”, “the enterprise three plan”), rule-based NER fails. Fine-tune a spaCy NER model on your annotated conversation data:

Step 1: Create training data from annotated transcripts

import spacy
from spacy.tokens import DocBin
from spacy.util import filter_spans

def create_training_data(annotated_examples: list[dict]) -> DocBin:
    """
    annotated_examples: [
        {
            "text": "Customer has the GTX Pro 2000 router",
            "entities": [(16, 29, "PRODUCT_SKU")]
        },
        ...
    ]
    """
    nlp = spacy.blank("en")
    doc_bin = DocBin()
    
    for example in annotated_examples:
        doc = nlp.make_doc(example["text"])
        ents = []
        
        for start_char, end_char, label in example["entities"]:
            span = doc.char_span(start_char, end_char, label=label)
            if span is not None:
                ents.append(span)
        
        doc.ents = filter_spans(ents)
        doc_bin.add(doc)
    
    return doc_bin

Step 2: Training configuration (config.cfg)

[paths]
train = ./data/train.spacy
dev = ./data/dev.spacy

[system]
gpu_allocator = null

[nlp]
lang = "en"
pipeline = ["tok2vec", "ner"]

[components]

[components.tok2vec]
factory = "tok2vec"

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true

[training]
max_epochs = 30
patience = 5

[training.optimizer]
@optimizers = "Adam.v1"

Step 3: Train and evaluate

python -m spacy train config.cfg --output ./models/product-ner --paths.train data/train.spacy --paths.dev data/dev.spacy --gpu-id -1
python -m spacy evaluate ./models/product-ner/model-best data/test.spacy

Step 4: Hybrid inference (regex + ML)

import spacy

nlp_ner = spacy.load("./models/product-ner/model-best")

def extract_entities_hybrid(text: str) -> list[EntitySpan]:
    """
    Run both regex and ML NER, merge results (regex takes precedence for high-confidence matches).
    """
    regex_entities = extract_entities_regex(text)
    
    # Only run ML NER on text segments not already covered by high-confidence regex matches
    high_conf_spans = {(e.start, e.end) for e in regex_entities if e.confidence >= 0.9}
    
    doc = nlp_ner(text)
    ml_entities = []
    
    for ent in doc.ents:
        overlap = any(
            ent.start_char < end and ent.end_char > start
            for start, end in high_conf_spans
        )
        
        if not overlap:
            ml_entities.append(EntitySpan(
                text=ent.text,
                entity_type=ent.label_,
                start=ent.start_char,
                end=ent.end_char,
                confidence=0.75,  # ML model base confidence
                metadata=PRODUCT_CATALOG.get(ent.text.upper(), {})
            ))
    
    return sorted(regex_entities + ml_entities, key=lambda e: e.start)

The Trap - training on synthetic data only: Fine-tuning a spaCy model on synthetic examples (“our product is the GTX-Pro-2000”) produces a model that fails on real customer speech (“I have the GTX pro two thousand” / “the G T X pro twenty hundred”). Always include real conversation transcripts in your training data - with natural speech disfluencies, mishearing variants, and informal references. Aim for 80% real / 20% synthetic in your training corpus.


4. Real-Time Integration with Genesys Cloud

The NER service runs as a microservice receiving transcript segments from the Genesys Cloud Notification API:

from fastapi import FastAPI, WebSocket
import asyncio

app = FastAPI()

@app.websocket("/ner/stream")
async def ner_websocket_endpoint(websocket: WebSocket):
    """
    Real-time NER endpoint for agent desktop integration.
    Receives transcript segments, returns entity annotations.
    """
    await websocket.accept()
    
    try:
        while True:
            data = await websocket.receive_json()
            
            transcript_segment = data.get("text", "")
            conversation_id = data.get("conversationId")
            speaker = data.get("speaker", "customer")
            
            # Only extract from customer speech (not agent)
            if speaker != "customer":
                await websocket.send_json({"entities": [], "speaker": "agent"})
                continue
            
            # Run hybrid NER
            entities = extract_entities_hybrid(transcript_segment)
            
            # Filter to high-confidence entities only for real-time
            confident_entities = [e for e in entities if e.confidence >= 0.75]
            
            # Enrich with product catalog data
            response = {
                "conversationId": conversation_id,
                "segment": transcript_segment,
                "entities": [
                    {
                        "text": e.text,
                        "type": e.entity_type,
                        "confidence": e.confidence,
                        "metadata": e.metadata,
                        "start": e.start,
                        "end": e.end
                    }
                    for e in confident_entities
                ]
            }
            
            await websocket.send_json(response)
    
    except Exception:
        await websocket.close()

The agent desktop Client App subscribes to this WebSocket and populates the product panel whenever a PRODUCT_SKU entity is detected with confidence >= 0.85.


5. Post-Call Analytics: Entity Extraction at Scale

For batch post-call processing (extracting product mentions from all conversations for business intelligence):

def process_conversation_transcript(transcript: str, conversation_id: str) -> dict:
    """Extract entities from a complete call transcript for BI pipeline."""
    entities = extract_entities_hybrid(transcript)
    
    return {
        "conversationId": conversation_id,
        "entities": [
            {
                "type": e.entity_type,
                "value": e.text,
                "confidence": e.confidence,
                "productName": e.metadata.get("name"),
                "productCategory": e.metadata.get("category"),
                "productStatus": e.metadata.get("status")
            }
            for e in entities
            if e.confidence >= 0.70
        ],
        "productsMentioned": list({
            e.metadata.get("name")
            for e in entities
            if e.entity_type == "PRODUCT_SKU" and e.metadata.get("name")
        }),
        "containsDiscontinuedProduct": any(
            e.metadata.get("status") == "Discontinued"
            for e in entities
            if e.entity_type == "PRODUCT_SKU"
        )
    }

This enables BI queries like “which discontinued products are still being mentioned in 5% of support calls” - surfacing products that need an end-of-life communication campaign.


Validation, Edge Cases & Troubleshooting

Edge Case 1: Customer Spelling Out Product Codes Letter-by-Letter

Customers frequently spell codes: “That’s G-T-X-Pro dash two thousand.” ASR transcribes this as “G T X Pro two thousand” - which doesn’t match GTX-Pro-2000. Add a pre-processing step that collapses single-letter sequences: "G T X""GTX", "two thousand""2000". A simple regex-and-number-word-map pre-processor before NER dramatically improves recall for spelled-out codes.

Edge Case 2: False Positives on Common Words Matching Patterns

The pattern [A-Z]{2,4}-[A-Za-z]+-\d{3,4} also matches natural English phrases like “US-Based-2024” or “EU-Region-100”. Use your product catalog dictionary as the authoritative filter: if the regex match isn’t in the catalog, downgrade confidence to 0.50 and only surface it to agents with a “Did you mean?” UX rather than auto-populating the product panel.

Edge Case 3: Model Drift as New Products Launch

New product codes launched after the NER model’s training cutoff are invisible to the ML model. The hybrid approach saves you here: new product codes are added to the product catalog dictionary, and the regex layer picks them up immediately without retraining. Schedule ML model retraining quarterly, incorporating transcripts that mention new products that were only caught by the dictionary layer.

Edge Case 4: Multi-Language Product Mentions

In multilingual contact centers, customers may mention product codes in a non-English sentence: “Mi router GTX-Pro-2000 está caído.” The regex/dictionary layer handles this correctly (product codes are language-agnostic). The ML spaCy model is language-specific - maintain separate models per language, or use a multilingual transformer base (spacy-transformers with xlm-roberta-base) that handles code-switching natively.


Official References