Implementing Cross-Channel Customer Identification using Probabilistic Identity Matching
What This Guide Covers
You are building a probabilistic identity resolution system that links disparate customer records across all Genesys Cloud interaction channels-voice (ANI), web chat (session cookie/email), email (sender address), WhatsApp (phone number), and SMS (phone number)-into a single unified Customer Identity, even when customers don’t provide consistent identifying information across channels. When complete, your system will automatically detect that the voice caller with ANI +15555551234 is the same person who previously chatted with email john.doe@example.com, using a probabilistic matching algorithm that computes match confidence scores rather than requiring exact field matches.
Prerequisites, Roles & Licensing
- Genesys Cloud: CX 2 or 3 with Omnichannel.
- Permissions required:
Analytics > Conversation Detail > View
- Infrastructure:
- A customer identity store (DynamoDB or PostgreSQL).
- A probabilistic matching service (Splink, or a custom Jaro-Winkler/phonetic distance implementation).
- An event-driven pipeline that updates identity records on every new interaction.
The Implementation Deep-Dive
1. The Multi-Channel Identity Problem
Customers interact across channels without consistent identifiers:
| Interaction | Identifier Available | Resolution Challenge |
|---|---|---|
| Voice call (Monday) | ANI: +15555551234 | Phone number only |
| Web chat (Tuesday) | Email: john.doe@example.com | No phone number |
| Email inquiry (Wednesday) | Email: j.doe@example.com | Different email variant |
| WhatsApp (Friday) | Phone: +15555551234 | Matches Monday phone |
Without identity resolution, these are 4 separate customers in your analytics system. A churn prediction model built on this data sees “4 customers with low contact history” instead of “1 customer with 4 contacts this week who is likely frustrated.”
2. The Probabilistic Matching Model
Probabilistic matching assigns weights to each field comparison:
from dataclasses import dataclass
from typing import Optional
import jellyfish # Jaro-Winkler distance
import phonenumbers
@dataclass
class CustomerRecord:
record_id: str
email: Optional[str] = None
email_normalized: Optional[str] = None
phone: Optional[str] = None
phone_normalized: Optional[str] = None
full_name: Optional[str] = None
ip_address: Optional[str] = None
cookie_id: Optional[str] = None
def normalize_phone(raw_phone: str, default_region: str = "US") -> Optional[str]:
"""Normalizes a phone number to E.164 format."""
try:
parsed = phonenumbers.parse(raw_phone, default_region)
if phonenumbers.is_valid_number(parsed):
return phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.E164)
except:
pass
return None
def normalize_email(email: str) -> str:
"""Normalizes an email for fuzzy comparison."""
email = email.lower().strip()
# Handle + aliases (john+work@example.com → john@example.com)
local, domain = email.rsplit('@', 1)
local = local.split('+')[0]
return f"{local}@{domain}"
def compute_match_score(record_a: CustomerRecord, record_b: CustomerRecord) -> float:
"""
Computes a probabilistic match score between two customer records.
Returns a score from 0.0 (no match) to 1.0 (exact match).
"""
total_weight = 0.0
scored_weight = 0.0
# Phone comparison (high weight - very discriminating)
if record_a.phone_normalized and record_b.phone_normalized:
total_weight += 0.45
if record_a.phone_normalized == record_b.phone_normalized:
scored_weight += 0.45 # Exact match
# (No partial credit for phone - numbers are either right or wrong)
# Email comparison (high weight)
if record_a.email_normalized and record_b.email_normalized:
total_weight += 0.40
if record_a.email_normalized == record_b.email_normalized:
scored_weight += 0.40 # Exact normalized match
else:
# Jaro-Winkler similarity for typo detection (j.doe vs john.doe)
similarity = jellyfish.jaro_winkler_similarity(
record_a.email_normalized, record_b.email_normalized
)
if similarity > 0.92:
scored_weight += 0.40 * similarity # Partial credit for near-matches
# Name comparison (medium weight - names can be nicknames/maiden names)
if record_a.full_name and record_b.full_name:
total_weight += 0.10
name_similarity = jellyfish.jaro_winkler_similarity(
record_a.full_name.lower(), record_b.full_name.lower()
)
if name_similarity > 0.85:
scored_weight += 0.10 * name_similarity
# Cookie/session ID (exact match only)
if record_a.cookie_id and record_b.cookie_id:
total_weight += 0.05
if record_a.cookie_id == record_b.cookie_id:
scored_weight += 0.05
if total_weight == 0:
return 0.0
return scored_weight / total_weight
3. The Identity Resolution Pipeline
import boto3
from typing import Optional
import uuid
DYNAMODB = boto3.resource('dynamodb')
IDENTITY_TABLE = DYNAMODB.Table('customer-identity-graph')
MATCH_THRESHOLD = 0.85 # Scores above this = same customer
def resolve_or_create_identity(new_record: CustomerRecord) -> str:
"""
For each new interaction, finds the existing customer identity or creates a new one.
Returns the resolved customerId.
"""
# 1. Search for candidates by exact phone or email lookup (fast path)
candidates = []
if new_record.phone_normalized:
result = IDENTITY_TABLE.query(
IndexName='phone-index',
KeyConditionExpression='phone_normalized = :phone',
ExpressionAttributeValues={':phone': new_record.phone_normalized}
)
candidates.extend(result.get('Items', []))
if new_record.email_normalized:
result = IDENTITY_TABLE.query(
IndexName='email-index',
KeyConditionExpression='email_normalized = :email',
ExpressionAttributeValues={':email': new_record.email_normalized}
)
candidates.extend(result.get('Items', []))
# 2. Score all candidates
best_match_id: Optional[str] = None
best_score = 0.0
for candidate_item in candidates:
candidate = CustomerRecord(
record_id=candidate_item['customerId'],
phone_normalized=candidate_item.get('phoneNormalized'),
email_normalized=candidate_item.get('emailNormalized'),
full_name=candidate_item.get('fullName'),
cookie_id=candidate_item.get('cookieId')
)
score = compute_match_score(new_record, candidate)
if score > best_score and score >= MATCH_THRESHOLD:
best_score = score
best_match_id = candidate_item['customerId']
if best_match_id:
# 3a. Merge new identifiers into the existing identity
update_existing_identity(best_match_id, new_record)
return best_match_id
else:
# 3b. Create a new identity
new_customer_id = str(uuid.uuid4())
create_new_identity(new_customer_id, new_record)
return new_customer_id
def update_existing_identity(customer_id: str, new_record: CustomerRecord):
"""Enriches an existing identity with new identifiers from this interaction."""
updates = []
values = {}
if new_record.phone_normalized:
updates.append('phoneNormalized = :phone')
values[':phone'] = new_record.phone_normalized
if new_record.email_normalized:
updates.append('emailNormalized = :email')
values[':email'] = new_record.email_normalized
if updates:
IDENTITY_TABLE.update_item(
Key={'customerId': customer_id},
UpdateExpression='SET ' + ', '.join(updates),
ExpressionAttributeValues=values
)
4. Tagging Genesys Conversations with the Resolved Customer ID
When a new interaction begins and a customer identifier is available (ANI for voice, email for chat), run identity resolution and set the resolved customerId as a Participant Data attribute via the Genesys Conversations API:
def tag_conversation_with_identity(conversation_id: str, customer_record: CustomerRecord, access_token: str):
"""Tags a live Genesys Cloud conversation with the resolved customer identity."""
resolved_id = resolve_or_create_identity(customer_record)
requests.patch(
f"https://api.mypurecloud.com/api/v2/conversations/calls/{conversation_id}/participants/{customer_record.record_id}/attributes",
headers={"Authorization": f"Bearer {access_token}"},
json={"attributes": {"resolvedCustomerId": resolved_id}}
)
return resolved_id
Validation, Edge Cases & Troubleshooting
Edge Case 1: False Positive Merges (Two Different People with Same Phone)
A shared phone number (e.g., a business front desk) can cause two completely different people to be merged into the same identity record.
Solution: Add a sanity check: if a single identity has more than 3 distinct full names associated with it, flag it as a “shared identifier” and stop auto-merging further records into it. Route it to a manual review queue.
Edge Case 2: Match Score Just Below the Threshold
A legitimate customer has a score of 0.82 (just below the 0.85 threshold) because their email is slightly different. They get a new identity record created, creating a duplicate.
Solution: Implement a “pending merge” queue for scores between 0.75 and 0.85. Route these to an agent or automatic follow-up: “We noticed you may have previously contacted us under a different email. Would you like us to link your records?” This customer-consent-based merge is both more accurate and GDPR-compliant.
Edge Case 3: GDPR Right to Erasure Breaking the Identity Graph
A customer requests deletion. You must delete their personal data from the identity graph. But their resolved customerId may link to dozens of interaction records in your analytics data warehouse.
Solution: Implement a “tombstone” pattern: delete PII from the identity record but retain the customerId as an anonymous identifier. Replace identifiable fields with "[REDACTED]". All linked analytics records remain associated with the anonymous customerId, preserving aggregate metrics without exposing personal data.