Implementing Cross-Channel Customer Identification using Probabilistic Identity Matching

Implementing Cross-Channel Customer Identification using Probabilistic Identity Matching

What This Guide Covers

You are building a probabilistic identity resolution system that links disparate customer records across all Genesys Cloud interaction channels-voice (ANI), web chat (session cookie/email), email (sender address), WhatsApp (phone number), and SMS (phone number)-into a single unified Customer Identity, even when customers don’t provide consistent identifying information across channels. When complete, your system will automatically detect that the voice caller with ANI +15555551234 is the same person who previously chatted with email john.doe@example.com, using a probabilistic matching algorithm that computes match confidence scores rather than requiring exact field matches.


Prerequisites, Roles & Licensing

  • Genesys Cloud: CX 2 or 3 with Omnichannel.
  • Permissions required:
    • Analytics > Conversation Detail > View
  • Infrastructure:
    • A customer identity store (DynamoDB or PostgreSQL).
    • A probabilistic matching service (Splink, or a custom Jaro-Winkler/phonetic distance implementation).
    • An event-driven pipeline that updates identity records on every new interaction.

The Implementation Deep-Dive

1. The Multi-Channel Identity Problem

Customers interact across channels without consistent identifiers:

Interaction Identifier Available Resolution Challenge
Voice call (Monday) ANI: +15555551234 Phone number only
Web chat (Tuesday) Email: john.doe@example.com No phone number
Email inquiry (Wednesday) Email: j.doe@example.com Different email variant
WhatsApp (Friday) Phone: +15555551234 Matches Monday phone

Without identity resolution, these are 4 separate customers in your analytics system. A churn prediction model built on this data sees “4 customers with low contact history” instead of “1 customer with 4 contacts this week who is likely frustrated.”


2. The Probabilistic Matching Model

Probabilistic matching assigns weights to each field comparison:

from dataclasses import dataclass
from typing import Optional
import jellyfish  # Jaro-Winkler distance
import phonenumbers

@dataclass
class CustomerRecord:
    record_id: str
    email: Optional[str] = None
    email_normalized: Optional[str] = None
    phone: Optional[str] = None
    phone_normalized: Optional[str] = None
    full_name: Optional[str] = None
    ip_address: Optional[str] = None
    cookie_id: Optional[str] = None

def normalize_phone(raw_phone: str, default_region: str = "US") -> Optional[str]:
    """Normalizes a phone number to E.164 format."""
    try:
        parsed = phonenumbers.parse(raw_phone, default_region)
        if phonenumbers.is_valid_number(parsed):
            return phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.E164)
    except:
        pass
    return None

def normalize_email(email: str) -> str:
    """Normalizes an email for fuzzy comparison."""
    email = email.lower().strip()
    # Handle + aliases (john+work@example.com → john@example.com)
    local, domain = email.rsplit('@', 1)
    local = local.split('+')[0]
    return f"{local}@{domain}"

def compute_match_score(record_a: CustomerRecord, record_b: CustomerRecord) -> float:
    """
    Computes a probabilistic match score between two customer records.
    Returns a score from 0.0 (no match) to 1.0 (exact match).
    """
    total_weight = 0.0
    scored_weight = 0.0
    
    # Phone comparison (high weight - very discriminating)
    if record_a.phone_normalized and record_b.phone_normalized:
        total_weight += 0.45
        if record_a.phone_normalized == record_b.phone_normalized:
            scored_weight += 0.45  # Exact match
        # (No partial credit for phone - numbers are either right or wrong)
    
    # Email comparison (high weight)
    if record_a.email_normalized and record_b.email_normalized:
        total_weight += 0.40
        if record_a.email_normalized == record_b.email_normalized:
            scored_weight += 0.40  # Exact normalized match
        else:
            # Jaro-Winkler similarity for typo detection (j.doe vs john.doe)
            similarity = jellyfish.jaro_winkler_similarity(
                record_a.email_normalized, record_b.email_normalized
            )
            if similarity > 0.92:
                scored_weight += 0.40 * similarity  # Partial credit for near-matches
    
    # Name comparison (medium weight - names can be nicknames/maiden names)
    if record_a.full_name and record_b.full_name:
        total_weight += 0.10
        name_similarity = jellyfish.jaro_winkler_similarity(
            record_a.full_name.lower(), record_b.full_name.lower()
        )
        if name_similarity > 0.85:
            scored_weight += 0.10 * name_similarity
    
    # Cookie/session ID (exact match only)
    if record_a.cookie_id and record_b.cookie_id:
        total_weight += 0.05
        if record_a.cookie_id == record_b.cookie_id:
            scored_weight += 0.05
    
    if total_weight == 0:
        return 0.0
    
    return scored_weight / total_weight

3. The Identity Resolution Pipeline

import boto3
from typing import Optional
import uuid

DYNAMODB = boto3.resource('dynamodb')
IDENTITY_TABLE = DYNAMODB.Table('customer-identity-graph')

MATCH_THRESHOLD = 0.85  # Scores above this = same customer

def resolve_or_create_identity(new_record: CustomerRecord) -> str:
    """
    For each new interaction, finds the existing customer identity or creates a new one.
    Returns the resolved customerId.
    """
    # 1. Search for candidates by exact phone or email lookup (fast path)
    candidates = []
    
    if new_record.phone_normalized:
        result = IDENTITY_TABLE.query(
            IndexName='phone-index',
            KeyConditionExpression='phone_normalized = :phone',
            ExpressionAttributeValues={':phone': new_record.phone_normalized}
        )
        candidates.extend(result.get('Items', []))
    
    if new_record.email_normalized:
        result = IDENTITY_TABLE.query(
            IndexName='email-index',
            KeyConditionExpression='email_normalized = :email',
            ExpressionAttributeValues={':email': new_record.email_normalized}
        )
        candidates.extend(result.get('Items', []))
    
    # 2. Score all candidates
    best_match_id: Optional[str] = None
    best_score = 0.0
    
    for candidate_item in candidates:
        candidate = CustomerRecord(
            record_id=candidate_item['customerId'],
            phone_normalized=candidate_item.get('phoneNormalized'),
            email_normalized=candidate_item.get('emailNormalized'),
            full_name=candidate_item.get('fullName'),
            cookie_id=candidate_item.get('cookieId')
        )
        
        score = compute_match_score(new_record, candidate)
        
        if score > best_score and score >= MATCH_THRESHOLD:
            best_score = score
            best_match_id = candidate_item['customerId']
    
    if best_match_id:
        # 3a. Merge new identifiers into the existing identity
        update_existing_identity(best_match_id, new_record)
        return best_match_id
    else:
        # 3b. Create a new identity
        new_customer_id = str(uuid.uuid4())
        create_new_identity(new_customer_id, new_record)
        return new_customer_id

def update_existing_identity(customer_id: str, new_record: CustomerRecord):
    """Enriches an existing identity with new identifiers from this interaction."""
    updates = []
    values = {}
    
    if new_record.phone_normalized:
        updates.append('phoneNormalized = :phone')
        values[':phone'] = new_record.phone_normalized
    
    if new_record.email_normalized:
        updates.append('emailNormalized = :email')
        values[':email'] = new_record.email_normalized
    
    if updates:
        IDENTITY_TABLE.update_item(
            Key={'customerId': customer_id},
            UpdateExpression='SET ' + ', '.join(updates),
            ExpressionAttributeValues=values
        )

4. Tagging Genesys Conversations with the Resolved Customer ID

When a new interaction begins and a customer identifier is available (ANI for voice, email for chat), run identity resolution and set the resolved customerId as a Participant Data attribute via the Genesys Conversations API:

def tag_conversation_with_identity(conversation_id: str, customer_record: CustomerRecord, access_token: str):
    """Tags a live Genesys Cloud conversation with the resolved customer identity."""
    resolved_id = resolve_or_create_identity(customer_record)
    
    requests.patch(
        f"https://api.mypurecloud.com/api/v2/conversations/calls/{conversation_id}/participants/{customer_record.record_id}/attributes",
        headers={"Authorization": f"Bearer {access_token}"},
        json={"attributes": {"resolvedCustomerId": resolved_id}}
    )
    
    return resolved_id

Validation, Edge Cases & Troubleshooting

Edge Case 1: False Positive Merges (Two Different People with Same Phone)

A shared phone number (e.g., a business front desk) can cause two completely different people to be merged into the same identity record.
Solution: Add a sanity check: if a single identity has more than 3 distinct full names associated with it, flag it as a “shared identifier” and stop auto-merging further records into it. Route it to a manual review queue.

Edge Case 2: Match Score Just Below the Threshold

A legitimate customer has a score of 0.82 (just below the 0.85 threshold) because their email is slightly different. They get a new identity record created, creating a duplicate.
Solution: Implement a “pending merge” queue for scores between 0.75 and 0.85. Route these to an agent or automatic follow-up: “We noticed you may have previously contacted us under a different email. Would you like us to link your records?” This customer-consent-based merge is both more accurate and GDPR-compliant.

Edge Case 3: GDPR Right to Erasure Breaking the Identity Graph

A customer requests deletion. You must delete their personal data from the identity graph. But their resolved customerId may link to dozens of interaction records in your analytics data warehouse.
Solution: Implement a “tombstone” pattern: delete PII from the identity record but retain the customerId as an anonymous identifier. Replace identifiable fields with "[REDACTED]". All linked analytics records remain associated with the anonymous customerId, preserving aggregate metrics without exposing personal data.

Official References