Designing Robust Webhooks and Idempotency for Third-Party Messaging Platform Webhooks

Designing Robust Webhooks and Idempotency for Third-Party Messaging Platform Webhooks

What This Guide Covers

You are designing the server-side architecture for a Genesys Cloud Open Messaging integration that receives inbound webhooks from third-party messaging platforms (WhatsApp via Meta, LINE, Viber, or custom enterprise SMS gateways). When complete, your integration middleware will handle webhook delivery guarantees correctly-processing each message exactly once even when the external platform retries delivery due to timeouts-and will survive burst traffic spikes, carrier retries, and duplicate delivery scenarios without creating phantom conversations or double-responses to customers.


Prerequisites, Roles & Licensing

  • Genesys Cloud: Any CX tier with Open Messaging configured.
  • Permissions required:
    • Conversations > Message > Create (for the inbound webhook handler’s service account)
    • Integrations > Integration > Edit (for Open Messaging integration config)
  • Infrastructure:
    • A serverless or containerized middleware layer (AWS Lambda + API Gateway, or a Node.js service behind a load balancer).
    • A distributed cache or database for idempotency key storage (Redis or DynamoDB).

The Implementation Deep-Dive

1. The Webhook Reliability Problem

Third-party messaging platforms (especially WhatsApp via Meta, and enterprise SMS gateways) have a critical, often-misunderstood behavior: they retry webhook deliveries aggressively if your endpoint does not respond with HTTP 200 within 5-20 seconds.

If your middleware takes 8 seconds to process a webhook (e.g., because it’s doing a CRM lookup, creating a Genesys Cloud conversation, and logging to a database synchronously), the platform assumes the delivery failed and retries. Now you have two identical webhook events processing concurrently. Without idempotency protection, this creates:

  1. Duplicate conversations in Genesys Cloud - the customer appears to have sent the same message twice.
  2. Double responses - the agent types one reply but it sends to the customer twice.
  3. Metrics inflation - your “Inbound Messages” counter is artificially doubled.

2. The Correct Pattern: Acknowledge First, Process Second

The cardinal rule of webhook receivers is: acknowledge receipt immediately, then process asynchronously.

# AWS Lambda (or equivalent) handling an inbound WhatsApp webhook
import json
import boto3

SQS = boto3.client('sqs')
QUEUE_URL = "https://sqs.us-east-1.amazonaws.com/123456/inbound-messages"

def lambda_handler(event, context):
    """
    Webhook receiver: ONLY validates and enqueues.
    Returns HTTP 200 in under 500ms to prevent retries.
    """
    body = json.loads(event.get('body', '{}'))
    
    # Step 1: Validate the webhook signature (HMAC-SHA256)
    # This is the ONLY synchronous operation allowed here
    if not validate_webhook_signature(event):
        return {'statusCode': 403, 'body': 'Invalid signature'}
    
    # Step 2: Extract the idempotency key
    # WhatsApp provides a unique message ID for every message
    message_id = body.get('entry', [{}])[0].get('changes', [{}])[0].get(
        'value', {}
    ).get('messages', [{}])[0].get('id', '')
    
    if not message_id:
        return {'statusCode': 200, 'body': 'No message to process'}
    
    # Step 3: Enqueue for async processing with deduplication ID
    SQS.send_message(
        QueueUrl=QUEUE_URL,
        MessageBody=json.dumps(body),
        # SQS FIFO with MessageDeduplicationId = native idempotency
        MessageGroupId="inbound-webhooks",
        MessageDeduplicationId=message_id
    )
    
    # Step 4: Return 200 IMMEDIATELY - before any further processing
    return {'statusCode': 200, 'body': 'Acknowledged'}

Why SQS FIFO? An SQS FIFO queue with MessageDeduplicationId set to the platform’s unique message ID will automatically reject any duplicate message received within a 5-minute deduplication window. If WhatsApp retries the same webhook 3 times, only one message enters the queue.


3. The Async Processor: Genesys Cloud Conversation Creation

A second Lambda function consumes the SQS queue and does the heavy work.

import json
import boto3
import requests
import redis

REDIS_CLIENT = redis.Redis(host='your-redis-cluster', port=6379, decode_responses=True)
IDEMPOTENCY_TTL_SECONDS = 3600  # 1 hour

def process_inbound_message(event, context):
    """Processes queued inbound messages exactly once."""
    
    for record in event.get('Records', []):
        body = json.loads(record['body'])
        
        # Extract platform message ID
        message_id = extract_message_id(body)
        
        # Redis-based idempotency check (belt-and-suspenders over SQS FIFO)
        redis_key = f"processed:webhook:{message_id}"
        if REDIS_CLIENT.exists(redis_key):
            print(f"[SKIP] Duplicate message {message_id} - already processed.")
            continue
        
        # Mark as being processed (SET with NX = only if Not eXists)
        acquired = REDIS_CLIENT.set(redis_key, "processing", nx=True, ex=IDEMPOTENCY_TTL_SECONDS)
        if not acquired:
            print(f"[SKIP] Concurrent processing detected for {message_id}.")
            continue
        
        try:
            # Do the actual work: create or find the Genesys conversation
            create_or_find_genesys_conversation(body)
            
            # Mark as fully processed
            REDIS_CLIENT.set(redis_key, "done", ex=IDEMPOTENCY_TTL_SECONDS)
            
        except Exception as e:
            # Remove the lock so the message can be retried
            REDIS_CLIENT.delete(redis_key)
            raise  # Re-raise to send message to SQS DLQ

4. Webhook Signature Validation (Security)

Every legitimate messaging platform signs its webhook payloads with an HMAC-SHA256 signature, computed using a shared secret you configured when registering the webhook URL.

Never skip signature validation. Without it, any attacker who discovers your webhook URL can flood Genesys Cloud with fake inbound conversations.

import hmac
import hashlib

def validate_webhook_signature(event: dict, shared_secret: str = "YOUR_WEBHOOK_SECRET") -> bool:
    """Validates WhatsApp webhook signature (X-Hub-Signature-256 header)."""
    
    signature_header = event.get('headers', {}).get('x-hub-signature-256', '')
    
    if not signature_header.startswith("sha256="):
        return False
    
    expected_signature = signature_header[7:]  # Strip "sha256="
    
    raw_body = event.get('body', '').encode('utf-8')
    computed_signature = hmac.new(
        shared_secret.encode('utf-8'),
        raw_body,
        hashlib.sha256
    ).hexdigest()
    
    # Use constant-time comparison to prevent timing attacks
    return hmac.compare_digest(expected_signature, computed_signature)

5. The Dead Letter Queue (DLQ) and Alerting

If a message fails processing (e.g., the Genesys Cloud API is temporarily unavailable), SQS automatically retries it. After a configured number of retries (e.g., 5), the message moves to a Dead Letter Queue.

DLQ Monitoring:

  1. Create a CloudWatch alarm on the ApproximateNumberOfMessagesVisible metric of the DLQ.
  2. If the DLQ depth exceeds 10 messages, trigger a PagerDuty alert immediately.
  3. Every message in the DLQ represents a customer message that was never delivered to an agent. This is a critical, customer-facing failure requiring urgent intervention.

DLQ Recovery:
Once the underlying issue is resolved, use the SQS redrive policy to move all DLQ messages back to the main queue for reprocessing. The Redis idempotency layer ensures that any messages already successfully processed (but re-queued due to a race condition) are silently skipped.


Validation, Edge Cases & Troubleshooting

Edge Case 1: The Platform Changes Message ID Format

If a third-party platform modifies its webhook payload structure during an API version upgrade, the field path used to extract the message_id might break, causing all messages to arrive with an empty message_id. The Redis key becomes processed:webhook:, and all messages collide - only the first message ever processes.
Solution: If message_id extraction returns an empty string, generate a deterministic fallback ID by hashing the full payload body: hashlib.sha256(raw_body).hexdigest(). This is not perfect (two genuinely identical messages would still be deduplicated), but it prevents the catastrophic “all messages map to the same key” failure.

Edge Case 2: Redis Unavailability

If your Redis cluster goes down, the REDIS_CLIENT.set() call throws an exception, causing every message to fail. Your DLQ fills up immediately.
Solution: Wrap the Redis operations in a try/except. If Redis is unavailable, log a critical alert and fall through to process the message without idempotency protection. This risks the rare duplicate, but it is far better than refusing to process any messages at all. The SQS FIFO MessageDeduplicationId still provides the first layer of protection.

Edge Case 3: Messages Arriving Out of Order

If a customer sends two messages in rapid succession (“Hello” then “I need help with billing”), and your async processor handles them concurrently, the second message might create a Genesys conversation before the first, resulting in an out-of-order transcript.
Solution: Use SQS FIFO with MessageGroupId set to the customer’s phone number (not a static value). FIFO groups guarantee ordered, sequential processing per customer, ensuring the first message always creates the conversation before the second message appends to it.

Official References