Implementing Contact List Hygiene Pipelines for Deduplication and Number Validation
What This Guide Covers
You are building an automated contact list hygiene pipeline that processes raw contact data before it is uploaded to Genesys Cloud or NICE CXone outbound dialing campaigns. When complete, your pipeline will automatically deduplicate records by phone number and email, validate phone numbers for dialability (active, not disconnected, mobile vs. landline classification), remove numbers on your DNC list, and produce a clean, compliant contact list with a detailed hygiene report-reducing wasted dial attempts, improving contact rates, and preventing TCPA violations from invalid or prohibited numbers.
Prerequisites, Roles & Licensing
- Genesys Cloud: Any CX tier with Outbound Dialing, or NICE CXone with outbound dialing.
- Infrastructure:
- Python with
phonenumbers,pandas, anddedupelibraries. - Optional: Twilio Lookup API or Numverify API for number validation (active/inactive, carrier, line type).
- Access to your DNC store (from the companion TCPA compliance guide).
- Python with
The Implementation Deep-Dive
1. The Dirty Data Problem
A typical outbound campaign list, compiled from a CRM export and a marketing lead database, may contain:
- 15% duplicate phone numbers (same number listed under multiple contacts).
- 8% invalid phone numbers (disconnected, malformed, or test data like
555-1234). - 5% misclassified numbers (landline dialed by a mobile-only campaign, or vice versa).
- 3% numbers on the DNC list that weren’t scrubbed before export.
- 2% international numbers in a US-only campaign (wrong country code).
Collectively, 33% of the list is wasted dial attempts. For a 100,000-record campaign at $0.02/call, that’s $660 in wasted telephony cost per campaign.
2. The Full Hygiene Pipeline
from dataclasses import dataclass
import pandas as pd
import phonenumbers
import hashlib
from typing import Optional
@dataclass
class HygieneReport:
total_records_in: int
after_deduplication: int
after_phone_validation: int
after_dnc_scrub: int
total_records_out: int
duplicates_removed: int
invalid_phones_removed: int
dnc_suppressed: int
landlines_flagged: int
international_removed: int
def run_hygiene_pipeline(
input_df: pd.DataFrame,
phone_column: str = "phone",
email_column: str = "email",
campaign_country: str = "US",
campaign_line_type: str = "any" # "mobile_only", "landline_only", or "any"
) -> tuple[pd.DataFrame, HygieneReport]:
"""
Full contact list hygiene pipeline.
Returns (clean_df, hygiene_report).
"""
total_in = len(input_df)
# Stage 1: Normalize phone numbers
input_df = normalize_phone_column(input_df, phone_column, campaign_country)
# Stage 2: Remove invalid phone numbers
valid_mask = input_df['phone_e164'].notna()
invalid_count = (~valid_mask).sum()
input_df = input_df[valid_mask].copy()
# Stage 3: Remove international numbers (if US-only campaign)
if campaign_country == "US":
us_mask = input_df['phone_e164'].str.startswith('+1')
intl_count = (~us_mask).sum()
input_df = input_df[us_mask].copy()
else:
intl_count = 0
# Stage 4: Deduplicate by phone number (keep first occurrence - CRM priority)
after_norm = len(input_df)
input_df = input_df.drop_duplicates(subset=['phone_e164'], keep='first')
dedup_count = after_norm - len(input_df)
# Stage 5: DNC scrub (calls the centralized DNC check from companion guide)
from dnc_registry import is_dnc
dnc_mask = input_df['phone_e164'].apply(lambda p: not is_dnc(p)[0])
dnc_count = (~dnc_mask).sum()
input_df = input_df[dnc_mask].copy()
# Stage 6: Line type classification (optional - requires Twilio Lookup API)
landline_count = 0
if campaign_line_type == "mobile_only":
line_types = input_df['phone_e164'].apply(lookup_line_type)
mobile_mask = line_types == "mobile"
landline_count = (~mobile_mask).sum()
input_df = input_df[mobile_mask].copy()
return input_df, HygieneReport(
total_records_in=total_in,
after_deduplication=after_norm - dedup_count,
after_phone_validation=after_norm - dedup_count - invalid_count,
after_dnc_scrub=len(input_df) + dnc_count,
total_records_out=len(input_df),
duplicates_removed=dedup_count,
invalid_phones_removed=invalid_count,
dnc_suppressed=dnc_count,
landlines_flagged=landline_count,
international_removed=intl_count
)
3. Phone Normalization
def normalize_phone_column(df: pd.DataFrame, phone_col: str, default_country: str) -> pd.DataFrame:
"""Normalizes all phone numbers in the DataFrame to E.164 format."""
def normalize_single(raw: str) -> Optional[str]:
if pd.isna(raw) or str(raw).strip() == '':
return None
raw = str(raw).strip()
# Remove common formatting artifacts
raw = raw.replace(' ', '').replace('-', '').replace('(', '').replace(')', '').replace('.', '')
try:
parsed = phonenumbers.parse(raw, default_country)
if phonenumbers.is_valid_number(parsed):
return phonenumbers.format_number(parsed, phonenumbers.PhoneNumberFormat.E164)
except phonenumbers.NumberParseException:
pass
return None # Invalid phone
df = df.copy()
df['phone_e164'] = df[phone_col].apply(normalize_single)
df['phone_original'] = df[phone_col] # Preserve original for audit
return df
4. Line Type Lookup (Twilio Carrier API)
import requests
TWILIO_ACCOUNT_SID = "your_account_sid"
TWILIO_AUTH_TOKEN = "your_auth_token"
def lookup_line_type(phone_e164: str) -> str:
"""
Uses Twilio Lookup API to determine line type: 'mobile', 'landline', or 'voip'.
Caches results to avoid duplicate API calls.
"""
try:
resp = requests.get(
f"https://lookups.twilio.com/v2/PhoneNumbers/{phone_e164}",
params={"Fields": "line_type_intelligence"},
auth=(TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN),
timeout=5
)
if resp.ok:
data = resp.json()
line_type = data.get("line_type_intelligence", {}).get("type", "unknown")
return line_type.lower()
except:
pass
return "unknown"
5. Generating the Hygiene Report
def generate_hygiene_report_markdown(report: HygieneReport, campaign_name: str) -> str:
rejection_rate = (1 - report.total_records_out / max(report.total_records_in, 1)) * 100
return f"""
# Contact List Hygiene Report: {campaign_name}
**Generated:** {datetime.utcnow().strftime('%Y-%m-%d %H:%M UTC')}
## Summary
| Stage | Records |
| :--- | ---: |
| **Input (raw)** | {report.total_records_in:,} |
| After deduplication | {report.after_deduplication:,} |
| After phone validation | {report.after_phone_validation:,} |
| After DNC scrub | {report.after_dnc_scrub:,} |
| **Output (clean)** | **{report.total_records_out:,}** |
## Removed Records
- Duplicates removed: **{report.duplicates_removed:,}**
- Invalid phones: **{report.invalid_phones_removed:,}**
- DNC suppressed: **{report.dnc_suppressed:,}**
- International (out-of-scope): **{report.international_removed:,}**
- Landlines (mobile-only campaign): **{report.landlines_flagged:,}**
**Overall rejection rate: {rejection_rate:.1f}%**
"""
Validation, Edge Cases & Troubleshooting
Edge Case 1: Twilio Lookup API Cost at Scale
Twilio’s Lookup API (Line Type Intelligence) costs $0.005 per lookup. For a 500,000-record list, that’s $2,500 per campaign scrub-potentially exceeding your telephony costs.
Solution: Cache line type results in Redis or DynamoDB with a 90-day TTL. Most numbers don’t change line type frequently. After the initial bulk lookup, subsequent campaigns reuse cached results. Only look up new/unseen numbers.
Edge Case 2: VoIP Numbers Misclassified as Mobile
Many VoIP numbers (Google Voice, WhatsApp Business numbers) are classified as “mobile” by carrier lookup APIs because they are assigned to mobile ranges. If you’re running a mobile-only campaign because TCPA restrictions apply differently to mobile calls, dialing a VoIP number with an autodialer still has TCPA exposure.
Solution: Classify VoIP numbers as requiring manual dialing (not predictive/automatic), even if they appear in the mobile range. Most carrier APIs return a type of voip separately from mobile.
Edge Case 3: Email-Based Deduplication Not Matching Phone-Based Duplicates
If Contact A has phone +15555551234, email: john@example.com and Contact B has phone +15555559876, email: john@example.com, deduplication by phone only leaves both in the list. They’re the same person with two phones.
Solution: Run a two-pass deduplication: first by phone E.164 (exact), then by normalized email. For email duplicates, keep the record with the more recently updated CRM timestamp.