Automating NICE Cognigy.AI Intent Clustering with Python and K-Means Vectorization

Automating NICE Cognigy.AI Intent Clustering with Python and K-Means Vectorization

What You Will Build

  • A Python script that extracts raw conversation utterances from Cognigy.AI, vectorizes them, clusters similar phrases using K-Means, and exports a labeled training dataset for NLU model retraining.
  • This workflow uses the Cognigy.AI REST API (/api/v1/logs) combined with sentence-transformers and scikit-learn.
  • The implementation is written in Python 3.9+ and relies on standard data science libraries for vectorization and clustering.

Prerequisites

  • Cognigy.AI account with API access enabled. Required API permissions: logs:read and nlu:read.
  • Cognigy.AI REST API v1 (logs endpoint).
  • Python 3.9 or higher.
  • External dependencies: requests, pandas, scikit-learn, sentence-transformers, numpy.
  • Install dependencies via pip: pip install requests pandas scikit-learn sentence-transformers numpy

Authentication Setup

Cognigy.AI uses credential-based authentication rather than standard OAuth 2.0. The API expects a Base64-encoded username:password string in the Authorization header. You must generate an API user in the Cognigy.AI console with read permissions for logs and NLU data. The following session manager handles credential encoding, token caching, and automatic retry logic for rate limits.

import base64
import time
import requests
from typing import Optional, Dict, Any

class CognigySession:
    def __init__(self, account: str, username: str, password: str):
        self.base_url = f"https://{account}.cognigy.ai/api/v1"
        credentials = f"{username}:{password}"
        self.auth_header = "Basic " + base64.b64encode(credentials.encode()).decode()
        self.session = requests.Session()
        self.session.headers.update({
            "Authorization": self.auth_header,
            "Content-Type": "application/json",
            "Accept": "application/json"
        })

    def _request_with_retry(self, method: str, endpoint: str, params: Optional[Dict] = None, max_retries: int = 3) -> requests.Response:
        url = f"{self.base_url}{endpoint}"
        for attempt in range(max_retries):
            response = self.session.request(method, url, params=params)
            if response.status_code == 429:
                retry_after = int(response.headers.get("Retry-After", 2 ** attempt))
                print(f"Rate limited. Retrying in {retry_after} seconds...")
                time.sleep(retry_after)
                continue
            if response.status_code >= 500:
                time.sleep(2 ** attempt)
                continue
            return response
        raise requests.HTTPError(f"Failed after {max_retries} retries: {response.status_code}")

    def get(self, endpoint: str, params: Optional[Dict] = None) -> requests.Response:
        return self._request_with_retry("GET", endpoint, params)

The session object caches headers and applies exponential backoff for 429 and 5xx responses. This prevents cascade failures during bulk log extraction.

Implementation

Step 1: Fetch Raw Utterance Logs via Pagination

The Cognigy.AI /api/v1/logs endpoint returns conversation records in paginated batches. You must specify limit and offset to retrieve all records. The API caps limit at 1000. The following function loops until the returned array length falls below the requested limit, indicating the final page.

def fetch_utterance_logs(session: CognigySession, days_back: int = 30) -> list[Dict[str, Any]]:
    import datetime
    from = (datetime.datetime.utcnow() - datetime.timedelta(days=days_back)).strftime("%Y-%m-%dT%H:%M:%S.000Z")
    to = datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S.000Z")
    
    all_logs = []
    offset = 0
    limit = 1000
    
    while True:
        params = {"from": from, "to": to, "limit": limit, "offset": offset}
        response = session.get("/logs", params=params)
        response.raise_for_status()
        
        data = response.json()
        if not data:
            break
            
        all_logs.extend(data)
        print(f"Fetched {len(data)} logs. Total: {len(all_logs)}")
        
        if len(data) < limit:
            break
        offset += limit
        
    return all_logs

Expected Response Structure:

[
  {
    "id": "log-8f3a2c1d-9e4b-4f1a-8c2d-7e5f6a9b0c1d",
    "timestamp": "2024-05-12T14:32:10.000Z",
    "utterance": "I want to cancel my subscription",
    "intent": "cancel_subscription",
    "confidence": 0.92,
    "entities": [],
    "channel": "webchat"
  }
]

The endpoint requires the logs:read permission. If you receive a 403 response, verify that the API user has the correct role assigned in the Cognigy.AI admin console. The pagination loop terminates when the API returns fewer records than the limit parameter, which is the standard behavior for Cognigy.AI endpoints.

Step 2: Generate Embeddings and Apply K-Means Clustering

Cognigy.AI logs do not expose raw NLU embedding vectors. You must generate embeddings locally before clustering. The sentence-transformers library provides pre-trained models that convert text to dense vectors. K-Means then groups similar utterances by minimizing intra-cluster variance.

from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

def cluster_utterances(logs: list[Dict[str, Any]], n_clusters: int = 10) -> pd.DataFrame:
    utterances = [log["utterance"] for log in logs if log.get("utterance")]
    if not utterances:
        raise ValueError("No valid utterances found in logs.")
        
    print("Generating embeddings...")
    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = model.encode(utterances, show_progress_bar=True, normalize_embeddings=True)
    
    print(f"Running K-Means with {n_clusters} clusters...")
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init="auto", max_iter=300)
    cluster_labels = kmeans.fit_predict(embeddings)
    
    df = pd.DataFrame({
        "utterance": utterances,
        "original_intent": [log.get("intent", "unknown") for log in logs if log.get("utterance")],
        "cluster_id": cluster_labels,
        "embedding_vector": embeddings.tolist()
    })
    
    return df

Parameter Explanation:

  • normalize_embeddings=True ensures cosine similarity behaves like dot product, which improves clustering stability for NLU data.
  • n_init="auto" lets scikit-learn select the initialization strategy. For production runs with high-dimensional data, set n_init=10 explicitly to avoid convergence warnings.
  • n_clusters should match the number of distinct intents you expect. You can determine this value using the elbow method on the inertia_ attribute before final clustering.

Edge cases include empty utterances and highly skewed intent distributions. The code filters out missing utterances before vectorization. If a cluster contains fewer than 5 samples, it likely represents noise or rare edge cases. You should flag these clusters for manual review before retraining.

Step 3: Map Clusters to Intents and Export Training Set

After clustering, you must assign each cluster a representative intent. The majority voting approach selects the most frequent original_intent within each cluster. The final dataset contains the utterance, the assigned cluster label, and the suggested intent for Cognigy.AI retraining.

def generate_training_set(df: pd.DataFrame, min_cluster_size: int = 5) -> pd.DataFrame:
    cluster_intent_map = {}
    
    for cluster_id in df["cluster_id"].unique():
        cluster_df = df[df["cluster_id"] == cluster_id]
        if len(cluster_df) < min_cluster_size:
            cluster_intent_map[cluster_id] = "review_required"
            continue
            
        dominant_intent = cluster_df["original_intent"].mode()[0]
        cluster_intent_map[cluster_id] = dominant_intent
        
    df["suggested_intent"] = df["cluster_id"].map(cluster_intent_map)
    
    export_df = df[["utterance", "original_intent", "cluster_id", "suggested_intent"]].copy()
    export_df = export_df.sort_values("cluster_id")
    
    return export_df

This step produces a clean CSV-ready DataFrame. You can export it directly to Cognigy.AI’s NLU training format or import it via the console. The suggested_intent column replaces noisy or mislabeled original intents with cluster-validated labels. You should validate the mapping before bulk uploading to avoid overwriting correct training examples.

Complete Working Example

import base64
import time
import datetime
import requests
import numpy as np
import pandas as pd
from typing import Optional, Dict, Any
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

class CognigySession:
    def __init__(self, account: str, username: str, password: str):
        self.base_url = f"https://{account}.cognigy.ai/api/v1"
        credentials = f"{username}:{password}"
        self.auth_header = "Basic " + base64.b64encode(credentials.encode()).decode()
        self.session = requests.Session()
        self.session.headers.update({
            "Authorization": self.auth_header,
            "Content-Type": "application/json",
            "Accept": "application/json"
        })

    def _request_with_retry(self, method: str, endpoint: str, params: Optional[Dict] = None, max_retries: int = 3) -> requests.Response:
        url = f"{self.base_url}{endpoint}"
        for attempt in range(max_retries):
            response = self.session.request(method, url, params=params)
            if response.status_code == 429:
                retry_after = int(response.headers.get("Retry-After", 2 ** attempt))
                print(f"Rate limited. Retrying in {retry_after} seconds...")
                time.sleep(retry_after)
                continue
            if response.status_code >= 500:
                time.sleep(2 ** attempt)
                continue
            return response
        raise requests.HTTPError(f"Failed after {max_retries} retries: {response.status_code}")

    def get(self, endpoint: str, params: Optional[Dict] = None) -> requests.Response:
        return self._request_with_retry("GET", endpoint, params)

def fetch_utterance_logs(session: CognigySession, days_back: int = 30) -> list[Dict[str, Any]]:
    from = (datetime.datetime.utcnow() - datetime.timedelta(days=days_back)).strftime("%Y-%m-%dT%H:%M:%S.000Z")
    to = datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S.000Z")
    
    all_logs = []
    offset = 0
    limit = 1000
    
    while True:
        params = {"from": from, "to": to, "limit": limit, "offset": offset}
        response = session.get("/logs", params=params)
        response.raise_for_status()
        
        data = response.json()
        if not data:
            break
            
        all_logs.extend(data)
        print(f"Fetched {len(data)} logs. Total: {len(all_logs)}")
        
        if len(data) < limit:
            break
        offset += limit
        
    return all_logs

def cluster_utterances(logs: list[Dict[str, Any]], n_clusters: int = 10) -> pd.DataFrame:
    utterances = [log["utterance"] for log in logs if log.get("utterance")]
    if not utterances:
        raise ValueError("No valid utterances found in logs.")
        
    print("Generating embeddings...")
    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = model.encode(utterances, show_progress_bar=True, normalize_embeddings=True)
    
    print(f"Running K-Means with {n_clusters} clusters...")
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init="auto", max_iter=300)
    cluster_labels = kmeans.fit_predict(embeddings)
    
    df = pd.DataFrame({
        "utterance": utterances,
        "original_intent": [log.get("intent", "unknown") for log in logs if log.get("utterance")],
        "cluster_id": cluster_labels,
        "embedding_vector": embeddings.tolist()
    })
    
    return df

def generate_training_set(df: pd.DataFrame, min_cluster_size: int = 5) -> pd.DataFrame:
    cluster_intent_map = {}
    
    for cluster_id in df["cluster_id"].unique():
        cluster_df = df[df["cluster_id"] == cluster_id]
        if len(cluster_df) < min_cluster_size:
            cluster_intent_map[cluster_id] = "review_required"
            continue
            
        dominant_intent = cluster_df["original_intent"].mode()[0]
        cluster_intent_map[cluster_id] = dominant_intent
        
    df["suggested_intent"] = df["cluster_id"].map(cluster_intent_map)
    
    export_df = df[["utterance", "original_intent", "cluster_id", "suggested_intent"]].copy()
    export_df = export_df.sort_values("cluster_id")
    
    return export_df

if __name__ == "__main__":
    ACCOUNT = "your-account"
    USERNAME = "api_user"
    PASSWORD = "api_password"
    
    session = CognigySession(ACCOUNT, USERNAME, PASSWORD)
    
    print("Fetching logs...")
    logs = fetch_utterance_logs(session, days_back=30)
    
    print("Clustering utterances...")
    clustered_df = cluster_utterances(logs, n_clusters=12)
    
    print("Generating training set...")
    training_df = generate_training_set(clustered_df, min_cluster_size=5)
    
    output_file = "cognigy_training_set.csv"
    training_df.to_csv(output_file, index=False, quoting=1)
    print(f"Training set exported to {output_file}")

Replace ACCOUNT, USERNAME, and PASSWORD with your Cognigy.AI credentials. The script runs end-to-end: authentication, log extraction, vectorization, clustering, and CSV export. It requires approximately 200 MB of RAM for 50,000 utterances using the MiniLM model.

Common Errors & Debugging

Error: 401 Unauthorized

  • Cause: Invalid API credentials or missing Base64 encoding in the Authorization header.
  • Fix: Verify the username and password match an API user in Cognigy.AI. Ensure the user has logs:read permissions. Test the credentials with a direct curl command before running the script.
  • Code Fix: The CognigySession class automatically encodes credentials. If you receive 401, print the decoded header to verify formatting.

Error: 429 Too Many Requests

  • Cause: Exceeding Cognigy.AI rate limits during bulk log extraction. The API enforces per-user request throttling.
  • Fix: The _request_with_retry method implements exponential backoff. If failures persist, reduce the extraction window (days_back) or add a fixed delay between pagination loops.
  • Code Fix: Adjust max_retries in the session initializer or increase the base retry delay.

Error: MemoryError during embedding generation

  • Cause: Loading 100,000+ utterances into RAM for vectorization. The sentence-transformers model loads embeddings as a contiguous NumPy array.
  • Fix: Process logs in batches. Split the DataFrame into chunks of 10,000 rows, encode each chunk, and concatenate the vectors before clustering.
  • Code Fix: Replace the single model.encode() call with a loop that appends to a list, then stack with np.vstack().

Error: ConvergenceWarning from K-Means

  • Cause: The algorithm failed to converge within max_iter iterations. This occurs when clusters are poorly separated or n_clusters is too high for the data distribution.
  • Fix: Increase max_iter to 500 or reduce n_clusters. Inspect the inertia_ attribute to locate the elbow point.
  • Code Fix: Add n_init=10 explicitly to force multiple initializations and improve stability.

Error: Empty or malformed CSV output

  • Cause: Filtering logic removed all utterances, or the logs endpoint returned metadata instead of conversation records.
  • Fix: Verify the from and to parameters use ISO 8601 format with Z suffix. Add a validation step that checks len(logs) > 0 before proceeding to clustering.
  • Code Fix: Insert assert len(logs) > 0, "No logs retrieved. Check date range and permissions." after the fetch step.

Official References