Intercepting and Optimizing LLM Context Windows for NICE Cognigy.AI Workflows in Node.js

Intercepting and Optimizing LLM Context Windows for NICE Cognigy.AI Workflows in Node.js

What You Will Build

  • The middleware intercepts incoming conversation history arrays from Cognigy.AI webhook triggers, calculates token consumption, truncates older dialogue turns while preserving system instructions and critical entities, compresses the truncated segment via a summarization endpoint, and forwards the optimized payload to an LLM gateway.
  • This implementation uses the NICE Cognigy.AI v2 REST API for conversation state management and a standard OpenAI-compatible LLM gateway for inference and summarization.
  • The code is written in Node.js using Express, async/await, and the tiktoken tokenizer library.

Prerequisites

  • OAuth client type: Confidential client configured in Cognigy.AI Studio. Required scopes: cognigy:bot:read, cognigy:conversation:read, cognigy:llm:write.
  • SDK/API version: Cognigy.AI v2 REST API, Node.js 18 or higher.
  • Language/runtime requirements: Node.js 18+, npm or pnpm package manager.
  • External dependencies: express, axios, tiktoken, dotenv, uuid.

Authentication Setup

Cognigy.AI uses standard OAuth 2.0 client credentials flow for server-to-server communication. The middleware requires a valid access token to read conversation history and write LLM integration results. Token caching prevents unnecessary authentication calls and reduces latency.

// auth.js
const axios = require('axios');
require('dotenv').config();

class CognigyAuthManager {
  constructor() {
    this.token = null;
    this.expiresAt = 0;
    this.baseUrl = process.env.COGNIGY_API_BASE;
    this.clientId = process.env.COGNIGY_CLIENT_ID;
    this.clientSecret = process.env.COGNIGY_CLIENT_SECRET;
    this.grantType = 'client_credentials';
    this.scopes = 'cognigy:bot:read cognigy:conversation:read cognigy:llm:write';
  }

  async getToken() {
    const now = Date.now();
    if (this.token && now < this.expiresAt - 60000) {
      return this.token;
    }

    try {
      const response = await axios.post(`${this.baseUrl}/api/v2/oauth/token`, null, {
        params: {
          client_id: this.clientId,
          client_secret: this.clientSecret,
          grant_type: this.grantType,
          scope: this.scopes
        },
        headers: { 'Content-Type': 'application/x-www-form-urlencoded' }
      });

      this.token = response.data.access_token;
      this.expiresAt = now + (response.data.expires_in * 1000);
      return this.token;
    } catch (error) {
      if (error.response) {
        throw new Error(`OAuth authentication failed: ${error.response.status} ${error.response.data.error_description}`);
      }
      throw error;
    }
  }
}

module.exports = new CognigyAuthManager();

The request sends client credentials to the /api/v2/oauth/token endpoint. The response returns a JWT access token and an expiration window. The manager caches the token and refreshes it sixty seconds before expiration to prevent mid-request authentication failures.

Implementation

Step 1: Middleware Setup and Conversation History Interception

The Express middleware catches POST requests from Cognigy.AI workflow triggers. It extracts the messages array, validates the structure, and prepares it for token analysis. Cognigy.AI passes conversation history as an array of objects containing role, content, and optional metadata.

// middleware.js
const express = require('express');
const router = express.Router();
const { processContextWindow } = require('./contextProcessor');
const { forwardToLLMGateway } = require('./llmGateway');

router.post('/cognigy/llm-trigger', async (req, res) => {
  try {
    const { botId, conversationId, messages, metadata } = req.body;

    if (!messages || !Array.isArray(messages)) {
      return res.status(400).json({ error: 'Missing or invalid messages array' });
    }

    // Intercept and process the conversation history
    const optimizedPayload = await processContextWindow(messages, metadata);

    // Forward optimized payload to LLM gateway
    const llmResponse = await forwardToLLMGateway(optimizedPayload);

    // Return structured response to Cognigy.AI
    return res.json({
      conversationId,
      botId,
      response: llmResponse.choices[0].message.content,
      metadata: {
        originalTokens: optimizedPayload.originalTokenCount,
        optimizedTokens: optimizedPayload.optimizedTokenCount,
        compressionApplied: optimizedPayload.compressionApplied
      }
    });
  } catch (error) {
    console.error('LLM Trigger Middleware Error:', error.message);
    return res.status(500).json({ error: 'Internal processing failure', details: error.message });
  }
});

module.exports = router;

The middleware expects a payload matching the Cognigy.AI LLM node webhook schema. It validates the messages array, delegates token optimization to a processor module, forwards the result to the LLM gateway, and returns a structured JSON response. The response includes metadata tracking token reduction for observability.

Step 2: Token Counting and Strategic Truncation

Token counting uses the tiktoken library with the cl100k_base encoding, which matches GPT-3.5 and GPT-4 tokenization behavior. The processor preserves system prompts and messages flagged as critical entities. It truncates older user and assistant turns until the total token count falls below the configured threshold.

// contextProcessor.js
const { Tiktoken } = require('tiktoken/lite');
const cl100k_base = require('tiktoken/encodings/cl100k_base.json');
const encoder = new Tiktoken(
  cl100k_base.bpe_ranks,
  cl100k_base.special_tokens,
  cl100k_base.pat_str
);

const MAX_TOKENS = 4000;
const SUMMARY_THRESHOLD = 2000;

function countTokens(text) {
  const tokens = encoder.encode(text);
  return tokens.length;
}

function isCriticalMessage(msg) {
  return msg.role === 'system' || (msg.metadata && msg.metadata.critical === true);
}

async function processContextWindow(messages, metadata = {}) {
  const originalTokenCount = messages.reduce((sum, msg) => sum + countTokens(msg.content), 0);
  
  if (originalTokenCount <= MAX_TOKENS) {
    return {
      messages,
      originalTokenCount,
      optimizedTokenCount: originalTokenCount,
      compressionApplied: false
    };
  }

  // Separate critical and non-critical messages
  const criticalMessages = messages.filter(isCriticalMessage);
  const regularMessages = messages.filter(msg => !isCriticalMessage);

  // Truncate older regular messages until under threshold
  let truncatedMessages = [...criticalMessages];
  let currentTokens = criticalMessages.reduce((sum, msg) => sum + countTokens(msg.content), 0);
  
  // Add regular messages from newest to oldest
  const reversedRegular = [...regularMessages].reverse();
  for (const msg of reversedRegular) {
    const msgTokens = countTokens(msg.content);
    if (currentTokens + msgTokens > MAX_TOKENS) {
      break;
    }
    truncatedMessages.unshift(msg);
    currentTokens += msgTokens;
  }

  // Identify messages that were dropped for summarization
  const droppedMessages = regularMessages.slice(0, regularMessages.length - reversedRegular.length + (reversedRegular.filter(m => truncatedMessages.includes(m)).length));

  if (droppedMessages.length > 0) {
    const summary = await compressContext(droppedMessages);
    truncatedMessages.unshift({
      role: 'assistant',
      content: `Previous conversation summary: ${summary}`,
      metadata: { generated: 'context-compressor' }
    });
  }

  const optimizedTokenCount = truncatedMessages.reduce((sum, msg) => sum + countTokens(msg.content), 0);

  return {
    messages: truncatedMessages,
    originalTokenCount,
    optimizedTokenCount,
    compressionApplied: droppedMessages.length > 0
  };
}

async function compressContext(droppedMessages) {
  // Implementation in Step 3
}

module.exports = { processContextWindow };

The processor calculates the total token count using encoder.encode(). It filters system prompts and critical entities into a protected array. It iterates through regular messages in reverse chronological order, adding them until the token limit is reached. Dropped messages are queued for compression. The function returns the optimized array alongside token metrics.

Step 3: Context Compression via Summarization API

The compression step sends dropped conversation turns to a summarization endpoint. The implementation uses exponential backoff for 429 rate limit responses. It constructs a concise summary that preserves intent and key data points without retaining conversational filler.

// contextProcessor.js (continued)
const axios = require('axios');

const SUMMARY_API_URL = process.env.SUMMARY_API_URL || 'https://api.openai.com/v1/chat/completions';
const SUMMARY_API_KEY = process.env.OPENAI_API_KEY;

async function compressContext(droppedMessages) {
  const systemPrompt = {
    role: 'system',
    content: 'Summarize the following conversation history concisely. Preserve user intent, key entities, and decision points. Output only the summary text.'
  };

  const userPrompt = {
    role: 'user',
    content: droppedMessages.map(m => `${m.role.toUpperCase()}: ${m.content}`).join('\n')
  };

  const maxRetries = 3;
  let attempt = 0;

  while (attempt < maxRetries) {
    try {
      const response = await axios.post(SUMMARY_API_URL, {
        model: 'gpt-3.5-turbo',
        messages: [systemPrompt, userPrompt],
        max_tokens: 500,
        temperature: 0.2
      }, {
        headers: {
          'Authorization': `Bearer ${SUMMARY_API_KEY}`,
          'Content-Type': 'application/json'
        },
        timeout: 10000
      });

      return response.data.choices[0].message.content.trim();
    } catch (error) {
      if (error.response && error.response.status === 429) {
        attempt++;
        const delay = Math.pow(2, attempt) * 1000;
        console.warn(`Summarization API rate limited. Retrying in ${delay}ms...`);
        await new Promise(resolve => setTimeout(resolve, delay));
        continue;
      }
      throw new Error(`Summarization failed: ${error.message}`);
    }
  }

  throw new Error('Summarization API exhausted retry attempts');
}

The summarization request sends a system instruction and concatenated conversation turns to the /v1/chat/completions endpoint. It sets temperature to 0.2 for deterministic output. The retry loop handles 429 responses with exponential backoff. It throws an error after three failed attempts to prevent infinite loops.

Step 4: Payload Injection and LLM Gateway Forwarding

The final step injects the optimized payload into the LLM gateway request. It preserves the original request structure while replacing the messages array with the compressed version. It forwards the request and streams or returns the completion based on gateway configuration.

// llmGateway.js
const axios = require('axios');

const LLM_GATEWAY_URL = process.env.LLM_GATEWAY_URL || 'https://api.openai.com/v1/chat/completions';
const LLM_GATEWAY_KEY = process.env.OPENAI_API_KEY;

async function forwardToLLMGateway(optimizedPayload) {
  const requestPayload = {
    model: process.env.LLM_MODEL || 'gpt-4',
    messages: optimizedPayload.messages,
    temperature: 0.7,
    max_tokens: 1000
  };

  try {
    const response = await axios.post(LLM_GATEWAY_URL, requestPayload, {
      headers: {
        'Authorization': `Bearer ${LLM_GATEWAY_KEY}`,
        'Content-Type': 'application/json'
      },
      timeout: 30000
    });

    return response.data;
  } catch (error) {
    if (error.response) {
      throw new Error(`LLM Gateway Error: ${error.response.status} ${error.response.data.error?.message || 'Unknown'}`);
    }
    throw error;
  }
}

module.exports = { forwardToLLMGateway };

The gateway module constructs a standard chat completion request. It passes the optimized messages array directly to the LLM provider. It captures the response and returns it to the middleware. Error handling captures HTTP status codes and extracts provider-specific error messages for debugging.

Complete Working Example

The following file combines authentication, middleware, context processing, and gateway forwarding into a single runnable Express server. Replace environment variables with your Cognigy.AI and LLM gateway credentials.

// server.js
require('dotenv').config();
const express = require('express');
const axios = require('axios');
const { Tiktoken } = require('tiktoken/lite');
const cl100k_base = require('tiktoken/encodings/cl100k_base.json');
const encoder = new Tiktoken(
  cl100k_base.bpe_ranks,
  cl100k_base.special_tokens,
  cl100k_base.pat_str
);

const app = express();
app.use(express.json({ limit: '10mb' }));

// Configuration
const COGNIGY_API_BASE = process.env.COGNIGY_API_BASE;
const COGNIGY_CLIENT_ID = process.env.COGNIGY_CLIENT_ID;
const COGNIGY_CLIENT_SECRET = process.env.COGNIGY_CLIENT_SECRET;
const OPENAI_API_KEY = process.env.OPENAI_API_KEY;
const MAX_TOKENS = 4000;

// Token Cache
let authToken = null;
let tokenExpiresAt = 0;

async function getCognigyToken() {
  const now = Date.now();
  if (authToken && now < tokenExpiresAt - 60000) return authToken;

  const res = await axios.post(`${COGNIGY_API_BASE}/api/v2/oauth/token`, null, {
    params: {
      client_id: COGNIGY_CLIENT_ID,
      client_secret: COGNIGY_CLIENT_SECRET,
      grant_type: 'client_credentials',
      scope: 'cognigy:bot:read cognigy:conversation:read cognigy:llm:write'
    }
  });

  authToken = res.data.access_token;
  tokenExpiresAt = now + (res.data.expires_in * 1000);
  return authToken;
}

function countTokens(text) {
  return encoder.encode(text).length;
}

async function compressContext(droppedMessages) {
  const systemPrompt = { role: 'system', content: 'Summarize the following conversation history concisely. Preserve user intent, key entities, and decision points.' };
  const userPrompt = { role: 'user', content: droppedMessages.map(m => `${m.role.toUpperCase()}: ${m.content}`).join('\n') };

  for (let attempt = 1; attempt <= 3; attempt++) {
    try {
      const res = await axios.post('https://api.openai.com/v1/chat/completions', {
        model: 'gpt-3.5-turbo', messages: [systemPrompt, userPrompt], max_tokens: 500, temperature: 0.2
      }, { headers: { 'Authorization': `Bearer ${OPENAI_API_KEY}` }, timeout: 10000 });
      return res.data.choices[0].message.content.trim();
    } catch (err) {
      if (err.response?.status === 429) {
        await new Promise(r => setTimeout(r, Math.pow(2, attempt) * 1000));
        continue;
      }
      throw err;
    }
  }
  throw new Error('Summarization retry limit exceeded');
}

app.post('/cognigy/llm-trigger', async (req, res) => {
  try {
    const { botId, conversationId, messages, metadata } = req.body;
    if (!Array.isArray(messages)) return res.status(400).json({ error: 'Invalid messages array' });

    const originalTokens = messages.reduce((sum, m) => sum + countTokens(m.content), 0);
    if (originalTokens <= MAX_TOKENS) {
      const llmRes = await axios.post('https://api.openai.com/v1/chat/completions', {
        model: 'gpt-4', messages, temperature: 0.7, max_tokens: 1000
      }, { headers: { 'Authorization': `Bearer ${OPENAI_API_KEY}` } });
      return res.json({ conversationId, response: llmRes.data.choices[0].message.content, metadata: { originalTokens, optimizedTokens: originalTokens, compressionApplied: false } });
    }

    const critical = messages.filter(m => m.role === 'system' || m.metadata?.critical === true);
    const regular = messages.filter(m => !critical.includes(m));
    const kept = [...critical];
    let currentTokens = critical.reduce((s, m) => s + countTokens(m.content), 0);

    const reversed = [...regular].reverse();
    const dropped = [];
    for (const msg of reversed) {
      if (currentTokens + countTokens(msg.content) <= MAX_TOKENS) {
        kept.unshift(msg);
        currentTokens += countTokens(msg.content);
      } else {
        dropped.unshift(msg);
      }
    }

    let optimizedMessages = kept;
    let compressionApplied = false;
    if (dropped.length > 0) {
      const summary = await compressContext(dropped);
      optimizedMessages.unshift({ role: 'assistant', content: `Previous conversation summary: ${summary}`, metadata: { generated: 'compressor' } });
      compressionApplied = true;
    }

    const optimizedTokens = optimizedMessages.reduce((s, m) => s + countTokens(m.content), 0);
    const llmRes = await axios.post('https://api.openai.com/v1/chat/completions', {
      model: 'gpt-4', messages: optimizedMessages, temperature: 0.7, max_tokens: 1000
    }, { headers: { 'Authorization': `Bearer ${OPENAI_API_KEY}` } });

    return res.json({
      conversationId,
      response: llmRes.data.choices[0].message.content,
      metadata: { originalTokens, optimizedTokens, compressionApplied }
    });
  } catch (error) {
    console.error('Processing error:', error.message);
    return res.status(500).json({ error: 'Internal failure', details: error.message });
  }
});

const PORT = process.env.PORT || 3000;
app.listen(PORT, () => console.log(`Cognigy.AI LLM Context Middleware running on port ${PORT}`));

The server initializes the tokenizer, caches OAuth tokens, and exposes a single endpoint. It calculates token usage, applies truncation rules, calls the summarization API with retry logic, and forwards the optimized payload to the LLM gateway. It returns structured responses with token metrics for monitoring.

Common Errors & Debugging

Error: 401 Unauthorized

  • What causes it: The OAuth token has expired, the client credentials are incorrect, or the requested scopes are not granted to the application.
  • How to fix it: Verify the COGNIGY_CLIENT_ID and COGNIGY_CLIENT_SECRET environment variables. Check the Cognigy.AI application configuration to ensure cognigy:bot:read, cognigy:conversation:read, and cognigy:llm:write scopes are assigned. Implement token refresh logic with a safety buffer.
  • Code showing the fix:
// Add scope validation before API calls
if (!token.includes('cognigy:llm:write')) {
  throw new Error('Missing required scope: cognigy:llm:write');
}

Error: 429 Too Many Requests

  • What causes it: The LLM gateway or summarization endpoint has exceeded rate limits. Concurrent workflow triggers can cascade into rapid throttling.
  • How to fix it: Implement exponential backoff with jitter. Queue incoming requests using a rate limiter middleware. Monitor token consumption to reduce unnecessary summarization calls.
  • Code showing the fix:
const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));
const jitter = (base) => base + Math.random() * (base / 2);

if (error.response?.status === 429) {
  const retryAfter = error.response.headers['retry-after'] || 2;
  await delay(jitter(retryAfter * 1000));
  // Retry logic continues
}

Error: 400 Bad Request (Invalid Message Role)

  • What causes it: The LLM gateway rejects the payload because the messages array contains unsupported roles, empty content fields, or malformed JSON structure.
  • How to fix it: Validate each message object before forwarding. Strip empty strings. Ensure roles match system, user, or assistant. Remove metadata fields that the target gateway does not support.
  • Code showing the fix:
const sanitizedMessages = optimizedMessages.map(m => ({
  role: m.role,
  content: m.content.trim() || '',
  name: m.name
})).filter(m => m.role && m.content.length > 0);

Official References