Extracting NICE Cognigy Bot Conversation Transcripts for Sentiment Analysis

Extracting NICE Cognigy Bot Conversation Transcripts for Sentiment Analysis

What You Will Build

This tutorial provides a production-grade Python script that retrieves completed conversation logs from a NICE Cognigy tenant and flattens them into a structured dataset optimized for downstream sentiment analysis pipelines. The implementation uses the Cognigy.AI REST API v1, handles cursor-style pagination, and implements exponential backoff with Retry-After header parsing to survive rate limiting. The code is written in Python 3.9+ using the requests library and standard type hints.

Prerequisites

  • A NICE Cognigy tenant with an API Key generated in Studio Settings
  • API Key permissions: Conversations: Read and Logs: Read
  • Python 3.9 or newer
  • External dependencies: requests, python-dotenv, tqdm
  • Basic understanding of HTTP status codes and REST pagination patterns

Authentication Setup

Cognigy.AI does not use standard OAuth 2.0 client credentials flows. Instead, it uses tenant-scoped API keys that function as bearer tokens. The key must be passed in the Authorization header with the Bearer prefix. The API validates the key against the tenant and enforces role-based permissions. If the key lacks the Conversations: Read scope, the API returns a 403 Forbidden response.

The following configuration loads the tenant URL and API key from environment variables. This approach prevents credential leakage in version control.

import os
from dotenv import load_dotenv
from requests import Session

load_dotenv()

COGNIGY_TENANT_URL = os.getenv("COGNIGY_TENANT_URL", "https://your-tenant.cognigy.ai")
COGNIGY_API_KEY = os.getenv("COGNIGY_API_KEY")

if not COGNIGY_API_KEY:
    raise ValueError("COGNIGY_API_KEY environment variable is required")

def create_authenticated_session() -> Session:
    session = Session()
    session.headers.update({
        "Authorization": f"Bearer {COGNIGY_API_KEY}",
        "Accept": "application/json",
        "Content-Type": "application/json",
        "User-Agent": "Cognigy-Transcript-Extractor/1.0"
    })
    return session

The Session object persists the authentication header across all requests. Cognigy validates the header on every call. If the key expires or is revoked, the API returns 401 Unauthorized. The session pattern reduces header overhead and enables connection pooling for high-volume pagination.

Implementation

Step 1: Initialize the API Client with Rate Limit Handling

Cognigy enforces tenant-level rate limits. When you exceed the threshold, the API returns HTTP 429 Too Many Requests with a Retry-After header indicating seconds to wait. A naive script will fail immediately. A production script must parse Retry-After, apply exponential backoff as a fallback, and respect the limit without blocking the entire pipeline.

The following class wraps requests.Session and implements a robust retry mechanism. It catches 429 responses, extracts the retry delay, and applies a maximum backoff cap to prevent infinite waits.

import time
import logging
from requests import Session, Response
from requests.exceptions import RequestException

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger(__name__)

class CognigyAPIClient:
    def __init__(self, tenant_url: str, api_key: str):
        self.base_url = tenant_url.rstrip("/")
        self.session = create_authenticated_session()
        self.max_retries = 5
        self.base_delay = 2.0

    def _handle_rate_limit(self, response: Response, attempt: int) -> float:
        retry_after = response.headers.get("Retry-After")
        if retry_after:
            try:
                return float(retry_after)
            except ValueError:
                pass
        return min(self.base_delay * (2 ** attempt), 60.0)

    def get(self, endpoint: str, params: dict | None = None) -> dict:
        url = f"{self.base_url}/{endpoint.lstrip('/')}"
        attempt = 0

        while attempt < self.max_retries:
            try:
                response = self.session.get(url, params=params, timeout=30)
                
                if response.status_code == 429:
                    delay = self._handle_rate_limit(response, attempt)
                    logger.warning("Rate limited. Retrying in %.2f seconds (attempt %d)", delay, attempt + 1)
                    time.sleep(delay)
                    attempt += 1
                    continue
                
                response.raise_for_status()
                return response.json()
                
            except RequestException as exc:
                logger.error("Request failed: %s", exc)
                raise

        raise RuntimeError("Max retries exceeded due to rate limiting")

The _handle_rate_limit method prioritizes the Retry-After header because Cognigy calculates the exact window when the quota resets. If the header is missing, the script falls back to exponential backoff capped at sixty seconds. This prevents thundering herd scenarios when multiple scripts poll the same tenant.

Step 2: Implement Paginated Conversation Retrieval

The /api/v1/conversations endpoint returns a paginated list of conversation metadata. Cognigy uses offset pagination with page and pageSize query parameters. The maximum pageSize is typically one hundred. The response includes a pagination object containing totalItems. You must iterate until the retrieved items equal totalItems or the returned array length falls below pageSize.

The following method fetches all completed conversations within a configurable date range. It filters by status=completed to exclude active sessions that may still be writing logs.

from typing import Iterator

class ConversationExtractor(CognigyAPIClient):
    def __init__(self, tenant_url: str, api_key: str, page_size: int = 100):
        super().__init__(tenant_url, api_key)
        self.page_size = min(page_size, 100)

    def list_conversations(self, from_date: str | None = None, to_date: str | None = None) -> Iterator[dict]:
        page = 1
        total_items = None

        while True:
            params = {
                "status": "completed",
                "page": page,
                "pageSize": self.page_size
            }
            if from_date:
                params["from"] = from_date
            if to_date:
                params["to"] = to_date

            response = self.get("/api/v1/conversations", params=params)
            data = response.get("data", [])
            pagination = response.get("pagination", {})

            if total_items is None:
                total_items = pagination.get("totalItems", 0)
                logger.info("Total completed conversations found: %d", total_items)

            for conversation in data:
                yield conversation

            if len(data) < self.page_size or page * self.page_size >= total_items:
                break
            page += 1

The generator pattern (yield) prevents memory exhaustion when processing tens of thousands of conversations. Cognigy returns conversation objects containing conversationId, channel, startTimestamp, and endTimestamp. The pagination loop terminates when the API returns fewer items than requested or when the calculated offset exceeds totalItems. This handles edge cases where the API returns empty pages due to concurrent deletions or tenant configuration changes.

Step 3: Extract and Flatten Conversation Logs

Each conversation log resides at /api/v1/conversations/{conversationId}/logs. The endpoint returns an array of log entries. Each entry contains a type field (user, bot, or system), a timestamp, and a text payload. Sentiment analysis pipelines require user utterances isolated from bot responses and system events. The following method fetches logs, filters for user messages, and flattens them into a uniform schema.

from datetime import datetime

class TranscriptProcessor(ConversationExtractor):
    def extract_user_transcripts(self, from_date: str | None = None, to_date: str | None = None) -> list[dict]:
        transcripts = []
        
        for conv in self.list_conversations(from_date, to_date):
            conv_id = conv.get("conversationId")
            if not conv_id:
                continue

            try:
                logs = self.get(f"/api/v1/conversations/{conv_id}/logs")
            except RequestException as exc:
                logger.warning("Failed to fetch logs for %s: %s", conv_id, exc)
                continue

            for entry in logs:
                if entry.get("type") != "user":
                    continue
                
                text = entry.get("text", "").strip()
                if not text:
                    continue

                transcripts.append({
                    "conversationId": conv_id,
                    "channel": conv.get("channel", "unknown"),
                    "timestamp": entry.get("timestamp"),
                    "userText": text,
                    "extractedAt": datetime.utcnow().isoformat()
                })

        return transcripts

The script filters type == "user" to exclude bot responses and internal routing events. It strips whitespace and skips empty payloads. The flattened output includes conversationId for traceability, channel for platform-specific sentiment tuning, and extractedAt for pipeline auditing. This structure maps directly to CSV, JSONL, or database schemas used by NLP frameworks like spaCy, Hugging Face Transformers, or cloud sentiment APIs.

Complete Working Example

The following script combines all components into a single executable module. It loads environment variables, initializes the client, extracts transcripts, and writes the output to a JSONL file. JSONL is preferred for sentiment pipelines because it supports streaming ingestion and preserves exact string formatting.

#!/usr/bin/env python3
import os
import json
import logging
from datetime import datetime, timedelta
from dotenv import load_dotenv

from cognigy_extractor import TranscriptProcessor

load_dotenv()

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger(__name__)

def main():
    tenant_url = os.getenv("COGNIGY_TENANT_URL")
    api_key = os.getenv("COGNIGY_API_KEY")
    output_file = os.getenv("OUTPUT_FILE", "cognigy_transcripts.jsonl")
    
    if not tenant_url or not api_key:
        raise ValueError("COGNIGY_TENANT_URL and COGNIGY_API_KEY are required")

    # Default to last 7 days if not specified
    to_date = datetime.utcnow().isoformat()
    from_date = (datetime.utcnow() - timedelta(days=7)).isoformat()

    logger.info("Initializing Cognigy transcript extractor")
    processor = TranscriptProcessor(tenant_url=tenant_url, api_key=api_key, page_size=100)

    logger.info("Extracting transcripts from %s to %s", from_date, to_date)
    transcripts = processor.extract_user_transcripts(from_date=from_date, to_date=to_date)
    logger.info("Extraction complete. Total user messages: %d", len(transcripts))

    with open(output_file, "w", encoding="utf-8") as f:
        for record in transcripts:
            f.write(json.dumps(record, ensure_ascii=False) + "\n")

    logger.info("Output written to %s", output_file)

if __name__ == "__main__":
    main()

To run the script, create a .env file with the following contents:

COGNIGY_TENANT_URL=https://your-tenant.cognigy.ai
COGNIGY_API_KEY=your_api_key_here
OUTPUT_FILE=cognigy_transcripts.jsonl

Execute the script with python cognigy_extractor.py. The script streams pagination, applies rate limit backoff, and writes one JSON object per line. You can pipe the output directly into sentiment analysis tools using cat cognigy_transcripts.jsonl | jq -r '.userText'.

Common Errors & Debugging

Error: HTTP 401 Unauthorized

  • Cause: The API key is invalid, expired, or missing the Bearer prefix in the header.
  • Fix: Regenerate the key in Cognigy Studio under Settings > API Keys. Verify the Authorization header matches Bearer <KEY> exactly. Do not include quotes around the key value.
  • Code verification: Print session.headers["Authorization"] before the first request to confirm formatting.

Error: HTTP 403 Forbidden

  • Cause: The API key lacks Conversations: Read or Logs: Read permissions.
  • Fix: Open Cognigy Studio, navigate to Settings > API Keys, and enable the Conversation and Log read scopes. Save the key and retry.
  • Debugging tip: Test the key against a lightweight endpoint first: GET /api/v1/conversations?page=1&pageSize=1. A 403 here confirms scope misconfiguration.

Error: HTTP 429 Too Many Requests

  • Cause: The script exceeds the tenant rate limit. Cognigy typically allows one hundred to two hundred requests per minute per API key.
  • Fix: The provided CognigyAPIClient already handles this via Retry-After parsing and exponential backoff. If failures persist, increase base_delay in the client or reduce page_size to one hundred.
  • Code verification: Monitor the warning logs. If you see repeated Rate limited messages, add a fixed time.sleep(1) between conversation log fetches to smooth request bursts.

Error: Pagination Loop Never Terminates

  • Cause: totalItems changes during execution because new conversations complete while the script runs, or the API returns inconsistent pagination metadata.
  • Fix: The generator checks len(data) < self.page_size as a hard termination condition. If the loop stalls, verify that pageSize does not exceed one hundred. Cognigy silently caps larger values, which breaks offset calculations.
  • Debugging tip: Log page and total_items on each iteration. If total_items grows indefinitely, switch to a time-windowed extraction strategy with fixed from and to parameters.

Official References