Handling Webhook 5xx Failures with a Dead Letter Queue Pattern

Handling Webhook 5xx Failures with a Dead Letter Queue Pattern

What You Will Build

  • You will build a resilient webhook receiver that captures HTTP 5xx errors from Genesys Cloud or NICE CXone and stores failed payloads in a dead letter queue for asynchronous retry.
  • You will use the Genesys Cloud REST API for webhook configuration and a Node.js Express server with AWS SQS for the dead letter implementation.
  • You will implement the solution in JavaScript (Node.js) using the express framework and aws-sdk.

Prerequisites

  • OAuth Client Type: Machine-to-Machine (M2M) client application registered in Genesys Cloud or NICE CXone.
  • Required Scopes:
    • Genesys Cloud: webhook:write, webhook:read, integration:write (if using integration webhooks).
    • NICE CXone: webhooks:write, webhooks:read.
  • SDK/API Version: Genesys Cloud API v2, NICE CXone API v1.
  • Language/Runtime: Node.js 18+ with npm.
  • External Dependencies:
    • express: Web server framework.
    • aws-sdk: For interacting with Amazon SQS (used here as the dead letter store).
    • uuid: For generating unique identifiers for failed messages.
    • dotenv: For managing environment variables securely.

Authentication Setup

Before implementing the webhook receiver, you must authenticate to the platform to configure the webhook endpoint. This tutorial uses Genesys Cloud for the API configuration examples, but the logic applies identically to NICE CXone.

Generating an Access Token

You will use the OAuth 2.0 Client Credentials flow to obtain an access token. This token grants your application permission to create and modify webhook configurations.

// auth.js
const axios = require('axios');
require('dotenv').config();

const GENESYS_CLOUD_ENVIRONMENT = process.env.GENESYS_CLOUD_ENVIRONMENT || 'us-east-1'; // e.g., us-east-1, eu-west-1
const CLIENT_ID = process.env.CLIENT_ID;
const CLIENT_SECRET = process.env.CLIENT_SECRET;

const getAccessToken = async () => {
  const url = `https://${GENESYS_CLOUD_ENVIRONMENT}.pure.cloudapps.oculab.com/oauth/token`;
  
  const payload = new URLSearchParams({
    grant_type: 'client_credentials',
    client_id: CLIENT_ID,
    client_secret: CLIENT_SECRET
  });

  try {
    const response = await axios.post(url, payload, {
      headers: {
        'Content-Type': 'application/x-www-form-urlencoded'
      }
    });

    if (response.status !== 200) {
      throw new Error(`Failed to authenticate: ${response.statusText}`);
    }

    return response.data.access_token;
  } catch (error) {
    console.error('Authentication Error:', error.message);
    throw error;
  }
};

module.exports = { getAccessToken };

Configuring the Webhook

You will create a webhook that points to your Express application. The critical setting here is the retry_count and retry_interval. While the platform handles initial retries for transient errors, a 5xx response from your server often indicates a processing failure rather than a network glitch. You will set the platform retry count to a low value (e.g., 1) to fail fast and move the payload to your dead letter queue.

// setup-webhook.js
const axios = require('axios');
const { getAccessToken } = require('./auth');

const createWebhook = async (targetUrl) => {
  const token = await getAccessToken();
  const environment = process.env.GENESYS_CLOUD_ENVIRONMENT || 'us-east-1';
  const url = `https://${environment}.pure.cloudapps.oculab.com/api/v2/integration/webhooks`;

  const webhookConfig = {
    name: "DLQ-Enabled Webhook",
    target_url: targetUrl,
    target_url_type: "https",
    event_type: "conversation:updated", // Example event type
    retry_count: 1, // Fail fast after one attempt
    retry_interval: 1000, // 1 second between retries
    headers: {
      "Content-Type": "application/json"
    }
  };

  try {
    const response = await axios.post(url, webhookConfig, {
      headers: {
        'Authorization': `Bearer ${token}`,
        'Content-Type': 'application/json',
        'Accept': 'application/json'
      }
    });

    console.log('Webhook Created:', response.data.id);
    return response.data;
  } catch (error) {
    console.error('Failed to create webhook:', error.response?.data || error.message);
    throw error;
  }
};

// Usage:
// createWebhook('https://your-api.example.com/webhooks/genesys');

OAuth Scope Required: webhook:write or integration:write depending on the specific webhook type.

Implementation

Step 1: Setting Up the Express Server and SQS Client

You will initialize the Express server and the AWS SQS client. The SQS client will manage two queues: a standard processing queue (optional, if you want to buffer) and a Dead Letter Queue (DLQ) for failed payloads. For this tutorial, you will send failed payloads directly to the DLQ.

// server.js
const express = require('express');
const AWS = require('aws-sdk');
const { v4: uuidv4 } = require('uuid');
require('dotenv').config();

const app = express();
app.use(express.json());

// Configure AWS SDK
AWS.config.update({
  region: process.env.AWS_REGION || 'us-east-1',
  accessKeyId: process.env.AWS_ACCESS_KEY_ID,
  secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY
});

const sqs = new AWS.SQS();
const DLQ_URL = process.env.SQS_DLQ_URL; // The URL of your Dead Letter Queue

if (!DLQ_URL) {
  throw new Error('SQS_DLQ_URL environment variable is not set');
}

// Helper to send message to DLQ
const sendToDLQ = async (payload, originalHeaders, errorMessage) => {
  const messageId = uuidv4();
  const dlqPayload = {
    messageId,
    timestamp: new Date().toISOString(),
    originalPayload: payload,
    originalHeaders: originalHeaders,
    errorReason: errorMessage,
    retryCount: 0
  };

  const params = {
    MessageBody: JSON.stringify(dlqPayload),
    QueueUrl: DLQ_URL
  };

  try {
    await sqs.sendMessage(params).promise();
    console.log(`Message sent to DLQ with ID: ${messageId}`);
  } catch (error) {
    console.error('Failed to send to DLQ:', error);
    throw error; // Critical failure, might need logging to external service
  }
};

module.exports = { app, sendToDLQ };

Step 2: Implementing the Webhook Receiver with Error Handling

The core logic resides in the webhook endpoint. When Genesys Cloud or NICE CXone sends an event, your server processes it. If the processing logic throws an error or returns a 5xx status, you must catch that error and push the payload to the DLQ.

Crucially, you must respond to the platform with a 200 OK if you have successfully queued the message for later processing, or a 500 if the DLQ itself failed. If you return 500, the platform will retry. If you return 200, the platform assumes success. This pattern is known as “at-least-once delivery with manual acknowledgment.”

// routes/webhook.js
const { Router } = require('express');
const { sendToDLQ } = require('../server');

const router = Router();

// Simulated business logic that might fail
const processBusinessLogic = async (payload) => {
  // Example: Updating a CRM record
  // If this call fails, it throws an error
  if (!payload.id) {
    throw new Error('Missing required field: id');
  }
  
  // Simulate a random failure to demonstrate DLQ usage
  if (Math.random() < 0.3) {
    throw new Error('Simulated downstream service failure');
  }

  return { success: true };
};

router.post('/genesys', async (req, res) => {
  const payload = req.body;
  const headers = req.headers;

  try {
    // 1. Attempt to process the webhook payload
    await processBusinessLogic(payload);

    // 2. If successful, return 200 OK
    res.status(200).json({ status: 'accepted' });
  } catch (error) {
    console.error('Webhook processing failed:', error.message);

    try {
      // 3. Send to Dead Letter Queue for retry
      await sendToDLQ(payload, headers, error.message);
      
      // 4. Return 200 OK to the platform to prevent infinite retries
      // The platform assumes the message is handled because we returned 200
      res.status(200).json({ status: 'queued_for_retry' });
    } catch (dlqError) {
      // 5. If DLQ fails, return 500 so the platform retries
      // This is a critical failure path
      console.error('DLQ send failed:', dlqError.message);
      res.status(500).json({ error: 'Internal Server Error' });
    }
  }
});

module.exports = router;

Why return 200 on DLQ success?
Genesys Cloud and NICE CXone interpret a 2xx response as a successful delivery. If you return 5xx, the platform will retry the delivery according to the retry_count and retry_interval settings. If your business logic is fundamentally broken (e.g., invalid data format), retrying will never succeed. By sending the payload to a DLQ and returning 200, you acknowledge receipt and move the problematic payload to a separate pipeline for investigation and manual or automated retry.

Step 3: Building the Retry Worker

The Dead Letter Queue is only useful if you have a consumer that retrieves messages and attempts to process them again. You will create a simple Node.js script that polls the SQS DLQ, processes the message, and deletes it from the queue if successful. If processing fails again, you can implement exponential backoff or move it to a “poison pill” archive.

// worker.js
const AWS = require('aws-sdk');
const { processBusinessLogic } = require('./logic'); // Import your actual business logic
require('dotenv').config();

AWS.config.update({
  region: process.env.AWS_REGION || 'us-east-1',
  accessKeyId: process.env.AWS_ACCESS_KEY_ID,
  secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY
});

const sqs = new AWS.SQS();
const DLQ_URL = process.env.SQS_DLQ_URL;

const processMessage = async (message) => {
  const body = JSON.parse(message.Body);
  const originalPayload = body.originalPayload;

  try {
    // Attempt to process the original payload again
    await processBusinessLogic(originalPayload);
    
    // If successful, delete the message from the queue
    await sqs.deleteMessage({
      QueueUrl: DLQ_URL,
      ReceiptHandle: message.ReceiptHandle
    }).promise();

    console.log(`Successfully processed message ${body.messageId}`);
  } catch (error) {
    console.error(`Failed to process message ${body.messageId}:`, error.message);
    
    // Optional: Implement retry logic with exponential backoff
    // For simplicity, we leave it in the queue. 
    // SQS will make the message invisible for the VisibilityTimeout period,
    // then make it visible again for another worker to pick up.
    
    // If you want to move it to a permanent archive after N failures,
    // you would check body.retryCount and send to an Archive Queue.
  }
};

const pollQueue = async () => {
  const params = {
    QueueUrl: DLQ_URL,
    MaxNumberOfMessages: 10,
    WaitTimeSeconds: 20 // Long polling
  };

  try {
    const data = await sqs.receiveMessage(params).promise();
    const messages = data.Messages || [];

    if (messages.length === 0) {
      console.log('No messages in DLQ');
      return;
    }

    console.log(`Processing ${messages.length} messages...`);
    
    // Process messages in parallel
    await Promise.all(messages.map(processMessage));
  } catch (error) {
    console.error('Error polling queue:', error);
  }
};

// Start polling
console.log('Worker started. Polling DLQ...');
setInterval(pollQueue, 5000); // Check every 5 seconds

Complete Working Example

Below is the full directory structure and code for a minimal viable product.

Directory Structure

/webhook-dlq
  /node_modules
  .env
  package.json
  auth.js
  server.js
  routes/
    webhook.js
  worker.js
  logic.js

.env

GENESYS_CLOUD_ENVIRONMENT=us-east-1
CLIENT_ID=your-client-id
CLIENT_SECRET=your-client-secret
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=your-aws-access-key
AWS_SECRET_ACCESS_KEY=your-aws-secret-key
SQS_DLQ_URL=https://sqs.us-east-1.amazonaws.com/123456789012/MyDeadLetterQueue
PORT=3000

package.json

{
  "name": "webhook-dlq",
  "version": "1.0.0",
  "dependencies": {
    "aws-sdk": "^2.1300.0",
    "axios": "^1.4.0",
    "dotenv": "^16.3.1",
    "express": "^4.18.2",
    "uuid": "^9.0.0"
  }
}

logic.js

// logic.js
const processBusinessLogic = async (payload) => {
  // Your actual business logic here
  if (!payload.id) throw new Error('Missing ID');
  console.log('Processing:', payload.id);
  return true;
};

module.exports = { processBusinessLogic };

server.js

// server.js
const express = require('express');
const AWS = require('aws-sdk');
const { v4: uuidv4 } = require('uuid');
const webhookRouter = require('./routes/webhook');
require('dotenv').config();

const app = express();
app.use(express.json());
app.use('/webhooks', webhookRouter);

AWS.config.update({
  region: process.env.AWS_REGION || 'us-east-1',
  accessKeyId: process.env.AWS_ACCESS_KEY_ID,
  secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY
});

const sqs = new AWS.SQS();
const DLQ_URL = process.env.SQS_DLQ_URL;

const sendToDLQ = async (payload, headers, errorMessage) => {
  const messageId = uuidv4();
  const dlqPayload = {
    messageId,
    timestamp: new Date().toISOString(),
    originalPayload: payload,
    originalHeaders: headers,
    errorReason: errorMessage,
    retryCount: 0
  };

  const params = {
    MessageBody: JSON.stringify(dlqPayload),
    QueueUrl: DLQ_URL
  };

  await sqs.sendMessage(params).promise();
};

// Export for use in routes if needed, or attach to app
app.locals.sendToDLQ = sendToDLQ;

const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
  console.log(`Server running on port ${PORT}`);
});

routes/webhook.js

// routes/webhook.js
const { Router } = require('express');
const { processBusinessLogic } = require('../logic');

const router = Router();

router.post('/genesys', async (req, res) => {
  const payload = req.body;
  const headers = req.headers;
  const sendToDLQ = req.app.locals.sendToDLQ;

  try {
    await processBusinessLogic(payload);
    res.status(200).json({ status: 'accepted' });
  } catch (error) {
    console.error('Processing failed:', error.message);
    try {
      await sendToDLQ(payload, headers, error.message);
      res.status(200).json({ status: 'queued_for_retry' });
    } catch (dlqError) {
      console.error('DLQ failed:', dlqError.message);
      res.status(500).json({ error: 'Internal Server Error' });
    }
  }
});

module.exports = router;

worker.js

// worker.js
const AWS = require('aws-sdk');
const { processBusinessLogic } = require('./logic');
require('dotenv').config();

AWS.config.update({
  region: process.env.AWS_REGION || 'us-east-1',
  accessKeyId: process.env.AWS_ACCESS_KEY_ID,
  secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY
});

const sqs = new AWS.SQS();
const DLQ_URL = process.env.SQS_DLQ_URL;

const processMessage = async (message) => {
  const body = JSON.parse(message.Body);
  try {
    await processBusinessLogic(body.originalPayload);
    await sqs.deleteMessage({
      QueueUrl: DLQ_URL,
      ReceiptHandle: message.ReceiptHandle
    }).promise();
  } catch (error) {
    console.error(`Worker failed for ${body.messageId}:`, error.message);
  }
};

const pollQueue = async () => {
  try {
    const data = await sqs.receiveMessage({
      QueueUrl: DLQ_URL,
      MaxNumberOfMessages: 10,
      WaitTimeSeconds: 20
    }).promise();
    
    const messages = data.Messages || [];
    if (messages.length > 0) {
      await Promise.all(messages.map(processMessage));
    }
  } catch (error) {
    console.error('Poll error:', error);
  }
};

setInterval(pollQueue, 5000);

Common Errors & Debugging

Error: 429 Too Many Requests

What causes it: You are sending messages to AWS SQS or processing webhooks faster than the service allows.
How to fix it: Implement exponential backoff in your retry logic. In the worker, if you receive a 429 from AWS, wait for a calculated delay before retrying.
Code Fix:

const retryWithBackoff = async (fn, retries = 3) => {
  for (let i = 0; i < retries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (error.code === 'Throttling' || error.statusCode === 429) {
        const delay = Math.pow(2, i) * 1000; // Exponential backoff
        await new Promise(resolve => setTimeout(resolve, delay));
        continue;
      }
      throw error;
    }
  }
  throw new Error('Max retries reached');
};

Error: 400 Bad Request from Genesys Cloud

What causes it: The webhook configuration is invalid, or the OAuth token used to create the webhook lacks the required scope.
How to fix it: Verify the event_type is valid for your platform version. Check the OAuth scopes associated with your client ID. Ensure the target_url is publicly accessible and returns HTTPS.
Debugging Step: Use the getAccessToken function to manually call the webhook creation endpoint with curl or Postman to isolate SDK issues.

Error: SQS Message Visibility Timeout Expired

What causes it: The worker takes longer than the VisibilityTimeout of the SQS queue to process a message.
How to fix it: Increase the VisibilityTimeout in the SQS queue settings, or extend the visibility timeout programmatically during long-running tasks using changeMessageVisibility.
Code Fix:

await sqs.changeMessageVisibility({
  QueueUrl: DLQ_URL,
  ReceiptHandle: message.ReceiptHandle,
  VisibilityTimeout: 300 // 5 minutes
}).promise();

Official References