Implementing Custom Entity Recognition (NER) for Extracting Proprietary Product Codes
What This Guide Covers
You are building a custom Named Entity Recognition (NER) pipeline that identifies your organization’s proprietary identifiers in real-time conversation transcripts - internal product SKUs (e.g., “GTX-Pro-2000”), service plan codes (“Enterprise-Tier-3”), model numbers, internal case reference formats, and contract IDs - entities that standard off-the-shelf NER models (spaCy, AWS Comprehend) have never seen and cannot recognize. When complete, every time a customer mentions a product code in voice or chat, the agent desktop automatically populates the product details panel from your product catalog, reducing average handle time by eliminating manual product lookup.
Prerequisites, Roles & Licensing
- Genesys Cloud: CX 2 or CX 3 with Speech and Text Analytics or a third-party NLU integration
- NER approach: Rule-based (regex + entity dictionaries) for structured codes; ML-based (fine-tuned spaCy or a transformer) for freeform product mentions
- Integration point: Post-call analytics pipeline, or real-time via Genesys Cloud transcript streaming API + a sidecar NER service
- Hosting: Lambda, Cloud Run, or a container - the NER model serves <50ms inference latency for real-time use
The Implementation Deep-Dive
1. Entity Type Design for Contact Center Identifiers
Before building the NER system, formally define your entity types and their expected patterns. The more precisely defined, the higher your precision:
| Entity Type | Example | Pattern |
|---|---|---|
PRODUCT_SKU |
GTX-Pro-2000, XT-500-B | [A-Z]{2,4}-[A-Za-z]+-\d{3,4}[A-Z]? |
CONTRACT_ID |
CNT-2024-001234 | CNT-\d{4}-\d{6} |
CASE_REF |
CASE-98765, TKT-12345 | (CASE|TKT)-\d{5,6} |
SERVICE_PLAN |
Enterprise-Tier-3, SMB-Pro | (Enterprise|SMB|Start)-[A-Za-z]+-\d? |
SERIAL_NUMBER |
SN: 4X7H-2091-KL | [0-9A-Z]{4}-\d{4}-[A-Z]{2} |
EMPLOYEE_CODE |
EMP-00456 | EMP-\d{5} |
Supplement regex patterns with a curated entity dictionary - a lookup table of all valid values:
import json
# product_catalog.json - loaded from your product database
PRODUCT_CATALOG = {
"GTX-Pro-2000": {"name": "GTX Pro 2000 Router", "category": "Networking", "status": "Active"},
"XT-500-B": {"name": "XT 500 Blaze Modem", "category": "Modem", "status": "Discontinued"},
"ENTERPRISE-TIER-3": {"name": "Enterprise Support Tier 3", "category": "Service Plan"}
}
Dictionary lookup is faster and more precise than regex alone - and crucially, it validates that the extracted code is a real entity, not just a string that matches the pattern.
2. Rule-Based NER with Regex + Dictionary Validation
import re
from dataclasses import dataclass
from typing import Optional
@dataclass
class EntitySpan:
text: str
entity_type: str
start: int
end: int
confidence: float
metadata: dict
# Ordered by specificity - more specific patterns first
ENTITY_PATTERNS = [
("CONTRACT_ID", r"\bCNT-\d{4}-\d{6}\b"),
("CASE_REF", r"\b(CASE|TKT)-\d{5,6}\b"),
("PRODUCT_SKU", r"\b[A-Z]{2,4}-[A-Za-z]+-\d{3,4}[A-Z]?\b"),
("SERVICE_PLAN", r"\b(Enterprise|SMB|Starter|Pro)-[A-Za-z]+-?\d?\b"),
("SERIAL_NUMBER", r"\bSN[:\s]?([0-9A-Z]{4}-\d{4}-[A-Z]{2})\b"),
("EMPLOYEE_CODE", r"\bEMP-\d{5}\b"),
]
def extract_entities_regex(text: str) -> list[EntitySpan]:
"""Extract entities using regex patterns."""
entities = []
matched_spans = [] # Track to avoid overlapping matches
for entity_type, pattern in ENTITY_PATTERNS:
for match in re.finditer(pattern, text, re.IGNORECASE):
# Check for overlap with already matched spans
overlap = any(
match.start() < end and match.end() > start
for start, end in matched_spans
)
if overlap:
continue
matched_text = match.group(0)
# Dictionary validation for PRODUCT_SKU
confidence = 0.85 # Base regex confidence
metadata = {}
if entity_type == "PRODUCT_SKU":
normalized = matched_text.upper()
if normalized in PRODUCT_CATALOG:
confidence = 0.99 # Dictionary-confirmed
metadata = PRODUCT_CATALOG[normalized]
else:
confidence = 0.65 # Pattern match only - not in catalog
entities.append(EntitySpan(
text=matched_text,
entity_type=entity_type,
start=match.start(),
end=match.end(),
confidence=confidence,
metadata=metadata
))
matched_spans.append((match.start(), match.end()))
# Sort by position in text
return sorted(entities, key=lambda e: e.start)
3. ML-Based NER with Fine-Tuned spaCy for Fuzzy Matches
For product mentions that don’t follow rigid patterns (“the pro two thousand model”, “the enterprise three plan”), rule-based NER fails. Fine-tune a spaCy NER model on your annotated conversation data:
Step 1: Create training data from annotated transcripts
import spacy
from spacy.tokens import DocBin
from spacy.util import filter_spans
def create_training_data(annotated_examples: list[dict]) -> DocBin:
"""
annotated_examples: [
{
"text": "Customer has the GTX Pro 2000 router",
"entities": [(16, 29, "PRODUCT_SKU")]
},
...
]
"""
nlp = spacy.blank("en")
doc_bin = DocBin()
for example in annotated_examples:
doc = nlp.make_doc(example["text"])
ents = []
for start_char, end_char, label in example["entities"]:
span = doc.char_span(start_char, end_char, label=label)
if span is not None:
ents.append(span)
doc.ents = filter_spans(ents)
doc_bin.add(doc)
return doc_bin
Step 2: Training configuration (config.cfg)
[paths]
train = ./data/train.spacy
dev = ./data/dev.spacy
[system]
gpu_allocator = null
[nlp]
lang = "en"
pipeline = ["tok2vec", "ner"]
[components]
[components.tok2vec]
factory = "tok2vec"
[components.ner]
factory = "ner"
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
[training]
max_epochs = 30
patience = 5
[training.optimizer]
@optimizers = "Adam.v1"
Step 3: Train and evaluate
python -m spacy train config.cfg --output ./models/product-ner --paths.train data/train.spacy --paths.dev data/dev.spacy --gpu-id -1
python -m spacy evaluate ./models/product-ner/model-best data/test.spacy
Step 4: Hybrid inference (regex + ML)
import spacy
nlp_ner = spacy.load("./models/product-ner/model-best")
def extract_entities_hybrid(text: str) -> list[EntitySpan]:
"""
Run both regex and ML NER, merge results (regex takes precedence for high-confidence matches).
"""
regex_entities = extract_entities_regex(text)
# Only run ML NER on text segments not already covered by high-confidence regex matches
high_conf_spans = {(e.start, e.end) for e in regex_entities if e.confidence >= 0.9}
doc = nlp_ner(text)
ml_entities = []
for ent in doc.ents:
overlap = any(
ent.start_char < end and ent.end_char > start
for start, end in high_conf_spans
)
if not overlap:
ml_entities.append(EntitySpan(
text=ent.text,
entity_type=ent.label_,
start=ent.start_char,
end=ent.end_char,
confidence=0.75, # ML model base confidence
metadata=PRODUCT_CATALOG.get(ent.text.upper(), {})
))
return sorted(regex_entities + ml_entities, key=lambda e: e.start)
The Trap - training on synthetic data only: Fine-tuning a spaCy model on synthetic examples (“our product is the GTX-Pro-2000”) produces a model that fails on real customer speech (“I have the GTX pro two thousand” / “the G T X pro twenty hundred”). Always include real conversation transcripts in your training data - with natural speech disfluencies, mishearing variants, and informal references. Aim for 80% real / 20% synthetic in your training corpus.
4. Real-Time Integration with Genesys Cloud
The NER service runs as a microservice receiving transcript segments from the Genesys Cloud Notification API:
from fastapi import FastAPI, WebSocket
import asyncio
app = FastAPI()
@app.websocket("/ner/stream")
async def ner_websocket_endpoint(websocket: WebSocket):
"""
Real-time NER endpoint for agent desktop integration.
Receives transcript segments, returns entity annotations.
"""
await websocket.accept()
try:
while True:
data = await websocket.receive_json()
transcript_segment = data.get("text", "")
conversation_id = data.get("conversationId")
speaker = data.get("speaker", "customer")
# Only extract from customer speech (not agent)
if speaker != "customer":
await websocket.send_json({"entities": [], "speaker": "agent"})
continue
# Run hybrid NER
entities = extract_entities_hybrid(transcript_segment)
# Filter to high-confidence entities only for real-time
confident_entities = [e for e in entities if e.confidence >= 0.75]
# Enrich with product catalog data
response = {
"conversationId": conversation_id,
"segment": transcript_segment,
"entities": [
{
"text": e.text,
"type": e.entity_type,
"confidence": e.confidence,
"metadata": e.metadata,
"start": e.start,
"end": e.end
}
for e in confident_entities
]
}
await websocket.send_json(response)
except Exception:
await websocket.close()
The agent desktop Client App subscribes to this WebSocket and populates the product panel whenever a PRODUCT_SKU entity is detected with confidence >= 0.85.
5. Post-Call Analytics: Entity Extraction at Scale
For batch post-call processing (extracting product mentions from all conversations for business intelligence):
def process_conversation_transcript(transcript: str, conversation_id: str) -> dict:
"""Extract entities from a complete call transcript for BI pipeline."""
entities = extract_entities_hybrid(transcript)
return {
"conversationId": conversation_id,
"entities": [
{
"type": e.entity_type,
"value": e.text,
"confidence": e.confidence,
"productName": e.metadata.get("name"),
"productCategory": e.metadata.get("category"),
"productStatus": e.metadata.get("status")
}
for e in entities
if e.confidence >= 0.70
],
"productsMentioned": list({
e.metadata.get("name")
for e in entities
if e.entity_type == "PRODUCT_SKU" and e.metadata.get("name")
}),
"containsDiscontinuedProduct": any(
e.metadata.get("status") == "Discontinued"
for e in entities
if e.entity_type == "PRODUCT_SKU"
)
}
This enables BI queries like “which discontinued products are still being mentioned in 5% of support calls” - surfacing products that need an end-of-life communication campaign.
Validation, Edge Cases & Troubleshooting
Edge Case 1: Customer Spelling Out Product Codes Letter-by-Letter
Customers frequently spell codes: “That’s G-T-X-Pro dash two thousand.” ASR transcribes this as “G T X Pro two thousand” - which doesn’t match GTX-Pro-2000. Add a pre-processing step that collapses single-letter sequences: "G T X" → "GTX", "two thousand" → "2000". A simple regex-and-number-word-map pre-processor before NER dramatically improves recall for spelled-out codes.
Edge Case 2: False Positives on Common Words Matching Patterns
The pattern [A-Z]{2,4}-[A-Za-z]+-\d{3,4} also matches natural English phrases like “US-Based-2024” or “EU-Region-100”. Use your product catalog dictionary as the authoritative filter: if the regex match isn’t in the catalog, downgrade confidence to 0.50 and only surface it to agents with a “Did you mean?” UX rather than auto-populating the product panel.
Edge Case 3: Model Drift as New Products Launch
New product codes launched after the NER model’s training cutoff are invisible to the ML model. The hybrid approach saves you here: new product codes are added to the product catalog dictionary, and the regex layer picks them up immediately without retraining. Schedule ML model retraining quarterly, incorporating transcripts that mention new products that were only caught by the dictionary layer.
Edge Case 4: Multi-Language Product Mentions
In multilingual contact centers, customers may mention product codes in a non-English sentence: “Mi router GTX-Pro-2000 está caído.” The regex/dictionary layer handles this correctly (product codes are language-agnostic). The ML spaCy model is language-specific - maintain separate models per language, or use a multilingual transformer base (spacy-transformers with xlm-roberta-base) that handles code-switching natively.