Implementing Contact List Hygiene Pipelines for Deduplication and Number Validation

Implementing Contact List Hygiene Pipelines for Deduplication and Number Validation

What This Guide Covers

You are building an automated contact list hygiene pipeline that processes raw contact data before it is uploaded to Genesys Cloud or NICE CXone outbound dialing campaigns. When complete, your pipeline will automatically deduplicate records by phone number and email, validate phone numbers for dialability (active, not disconnected, mobile vs. landline classification), remove numbers on your DNC list, and produce a clean, compliant contact list with a detailed hygiene report-reducing wasted dial attempts, improving contact rates, and preventing TCPA violations from invalid or prohibited numbers.


Prerequisites, Roles & Licensing

  • Genesys Cloud: Any CX tier with Outbound Dialing, or NICE CXone with outbound dialing.
  • Infrastructure:
    • Python with phonenumbers, pandas, and dedupe libraries.
    • Optional: Twilio Lookup API or Numverify API for number validation (active/inactive, carrier, line type).
    • Access to your DNC store (from the companion TCPA compliance guide).

The Implementation Deep-Dive

1. The Dirty Data Problem

A typical outbound campaign list, compiled from a CRM export and a marketing lead database, may contain:

  • 15% duplicate phone numbers (same number listed under multiple contacts).
  • 8% invalid phone numbers (disconnected, malformed, or test data like 555-1234).
  • 5% misclassified numbers (landline dialed by a mobile-only campaign, or vice versa).
  • 3% numbers on the DNC list that weren’t scrubbed before export.
  • 2% international numbers in a US-only campaign (wrong country code).

Collectively, 33% of the list is wasted dial attempts. For a 100,000-record campaign at $0.02/call, that’s $660 in wasted telephony cost per campaign.


2. The Full Hygiene Pipeline

from dataclasses import dataclass
import pandas as pd
import phonenumbers
import hashlib
from typing import Optional

@dataclass
class HygieneReport:
    total_records_in: int
    after_deduplication: int
    after_phone_validation: int
    after_dnc_scrub: int
    total_records_out: int
    duplicates_removed: int
    invalid_phones_removed: int
    dnc_suppressed: int
    landlines_flagged: int
    international_removed: int

def run_hygiene_pipeline(
    input_df: pd.DataFrame,
    phone_column: str = "phone",
    email_column: str = "email",
    campaign_country: str = "US",
    campaign_line_type: str = "any"  # "mobile_only", "landline_only", or "any"
) -> tuple[pd.DataFrame, HygieneReport]:
    """
    Full contact list hygiene pipeline.
    Returns (clean_df, hygiene_report).
    """
    total_in = len(input_df)
    
    # Stage 1: Normalize phone numbers
    input_df = normalize_phone_column(input_df, phone_column, campaign_country)
    
    # Stage 2: Remove invalid phone numbers
    valid_mask = input_df['phone_e164'].notna()
    invalid_count = (~valid_mask).sum()
    input_df = input_df[valid_mask].copy()
    
    # Stage 3: Remove international numbers (if US-only campaign)
    if campaign_country == "US":
        us_mask = input_df['phone_e164'].str.startswith('+1')
        intl_count = (~us_mask).sum()
        input_df = input_df[us_mask].copy()
    else:
        intl_count = 0
    
    # Stage 4: Deduplicate by phone number (keep first occurrence - CRM priority)
    after_norm = len(input_df)
    input_df = input_df.drop_duplicates(subset=['phone_e164'], keep='first')
    dedup_count = after_norm - len(input_df)
    
    # Stage 5: DNC scrub (calls the centralized DNC check from companion guide)
    from dnc_registry import is_dnc
    dnc_mask = input_df['phone_e164'].apply(lambda p: not is_dnc(p)[0])
    dnc_count = (~dnc_mask).sum()
    input_df = input_df[dnc_mask].copy()
    
    # Stage 6: Line type classification (optional - requires Twilio Lookup API)
    landline_count = 0
    if campaign_line_type == "mobile_only":
        line_types = input_df['phone_e164'].apply(lookup_line_type)
        mobile_mask = line_types == "mobile"
        landline_count = (~mobile_mask).sum()
        input_df = input_df[mobile_mask].copy()
    
    return input_df, HygieneReport(
        total_records_in=total_in,
        after_deduplication=after_norm - dedup_count,
        after_phone_validation=after_norm - dedup_count - invalid_count,
        after_dnc_scrub=len(input_df) + dnc_count,
        total_records_out=len(input_df),
        duplicates_removed=dedup_count,
        invalid_phones_removed=invalid_count,
        dnc_suppressed=dnc_count,
        landlines_flagged=landline_count,
        international_removed=intl_count
    )

3. Phone Normalization

def normalize_phone_column(df: pd.DataFrame, phone_col: str, default_country: str) -> pd.DataFrame:
    """Normalizes all phone numbers in the DataFrame to E.164 format."""
    
    def normalize_single(raw: str) -> Optional[str]:
        if pd.isna(raw) or str(raw).strip() == '':
            return None
        
        raw = str(raw).strip()
        
        # Remove common formatting artifacts
        raw = raw.replace(' ', '').replace('-', '').replace('(', '').replace(')', '').replace('.', '')
        
        try:
            parsed = phonenumbers.parse(raw, default_country)
            if phonenumbers.is_valid_number(parsed):
                return phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.E164)
        except phonenumbers.NumberParseException:
            pass
        
        return None  # Invalid phone
    
    df = df.copy()
    df['phone_e164'] = df[phone_col].apply(normalize_single)
    df['phone_original'] = df[phone_col]  # Preserve original for audit
    
    return df

4. Line Type Lookup (Twilio Carrier API)

import requests

TWILIO_ACCOUNT_SID = "your_account_sid"
TWILIO_AUTH_TOKEN = "your_auth_token"

def lookup_line_type(phone_e164: str) -> str:
    """
    Uses Twilio Lookup API to determine line type: 'mobile', 'landline', or 'voip'.
    Caches results to avoid duplicate API calls.
    """
    try:
        resp = requests.get(
            f"https://lookups.twilio.com/v2/PhoneNumbers/{phone_e164}",
            params={"Fields": "line_type_intelligence"},
            auth=(TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN),
            timeout=5
        )
        
        if resp.ok:
            data = resp.json()
            line_type = data.get("line_type_intelligence", {}).get("type", "unknown")
            return line_type.lower()
    except:
        pass
    
    return "unknown"

5. Generating the Hygiene Report

def generate_hygiene_report_markdown(report: HygieneReport, campaign_name: str) -> str:
    rejection_rate = (1 - report.total_records_out / max(report.total_records_in, 1)) * 100
    
    return f"""
# Contact List Hygiene Report: {campaign_name}
**Generated:** {datetime.utcnow().strftime('%Y-%m-%d %H:%M UTC')}

## Summary
| Stage | Records |
| :--- | ---: |
| **Input (raw)** | {report.total_records_in:,} |
| After deduplication | {report.after_deduplication:,} |
| After phone validation | {report.after_phone_validation:,} |
| After DNC scrub | {report.after_dnc_scrub:,} |
| **Output (clean)** | **{report.total_records_out:,}** |

## Removed Records
- Duplicates removed: **{report.duplicates_removed:,}**
- Invalid phones: **{report.invalid_phones_removed:,}**
- DNC suppressed: **{report.dnc_suppressed:,}**
- International (out-of-scope): **{report.international_removed:,}**
- Landlines (mobile-only campaign): **{report.landlines_flagged:,}**

**Overall rejection rate: {rejection_rate:.1f}%**
"""

Validation, Edge Cases & Troubleshooting

Edge Case 1: Twilio Lookup API Cost at Scale

Twilio’s Lookup API (Line Type Intelligence) costs $0.005 per lookup. For a 500,000-record list, that’s $2,500 per campaign scrub-potentially exceeding your telephony costs.
Solution: Cache line type results in Redis or DynamoDB with a 90-day TTL. Most numbers don’t change line type frequently. After the initial bulk lookup, subsequent campaigns reuse cached results. Only look up new/unseen numbers.

Edge Case 2: VoIP Numbers Misclassified as Mobile

Many VoIP numbers (Google Voice, WhatsApp Business numbers) are classified as “mobile” by carrier lookup APIs because they are assigned to mobile ranges. If you’re running a mobile-only campaign because TCPA restrictions apply differently to mobile calls, dialing a VoIP number with an autodialer still has TCPA exposure.
Solution: Classify VoIP numbers as requiring manual dialing (not predictive/automatic), even if they appear in the mobile range. Most carrier APIs return a type of voip separately from mobile.

Edge Case 3: Email-Based Deduplication Not Matching Phone-Based Duplicates

If Contact A has phone +15555551234, email: john@example.com and Contact B has phone +15555559876, email: john@example.com, deduplication by phone only leaves both in the list. They’re the same person with two phones.
Solution: Run a two-pass deduplication: first by phone E.164 (exact), then by normalized email. For email duplicates, keep the record with the more recently updated CRM timestamp.

Official References