Automating NICE Cognigy.AI Intent Clustering with Python and K-Means Vectorization
What You Will Build
- A Python script that extracts raw conversation utterances from Cognigy.AI, vectorizes them, clusters similar phrases using K-Means, and exports a labeled training dataset for NLU model retraining.
- This workflow uses the Cognigy.AI REST API (
/api/v1/logs) combined withsentence-transformersandscikit-learn. - The implementation is written in Python 3.9+ and relies on standard data science libraries for vectorization and clustering.
Prerequisites
- Cognigy.AI account with API access enabled. Required API permissions:
logs:readandnlu:read. - Cognigy.AI REST API v1 (logs endpoint).
- Python 3.9 or higher.
- External dependencies:
requests,pandas,scikit-learn,sentence-transformers,numpy. - Install dependencies via pip:
pip install requests pandas scikit-learn sentence-transformers numpy
Authentication Setup
Cognigy.AI uses credential-based authentication rather than standard OAuth 2.0. The API expects a Base64-encoded username:password string in the Authorization header. You must generate an API user in the Cognigy.AI console with read permissions for logs and NLU data. The following session manager handles credential encoding, token caching, and automatic retry logic for rate limits.
import base64
import time
import requests
from typing import Optional, Dict, Any
class CognigySession:
def __init__(self, account: str, username: str, password: str):
self.base_url = f"https://{account}.cognigy.ai/api/v1"
credentials = f"{username}:{password}"
self.auth_header = "Basic " + base64.b64encode(credentials.encode()).decode()
self.session = requests.Session()
self.session.headers.update({
"Authorization": self.auth_header,
"Content-Type": "application/json",
"Accept": "application/json"
})
def _request_with_retry(self, method: str, endpoint: str, params: Optional[Dict] = None, max_retries: int = 3) -> requests.Response:
url = f"{self.base_url}{endpoint}"
for attempt in range(max_retries):
response = self.session.request(method, url, params=params)
if response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", 2 ** attempt))
print(f"Rate limited. Retrying in {retry_after} seconds...")
time.sleep(retry_after)
continue
if response.status_code >= 500:
time.sleep(2 ** attempt)
continue
return response
raise requests.HTTPError(f"Failed after {max_retries} retries: {response.status_code}")
def get(self, endpoint: str, params: Optional[Dict] = None) -> requests.Response:
return self._request_with_retry("GET", endpoint, params)
The session object caches headers and applies exponential backoff for 429 and 5xx responses. This prevents cascade failures during bulk log extraction.
Implementation
Step 1: Fetch Raw Utterance Logs via Pagination
The Cognigy.AI /api/v1/logs endpoint returns conversation records in paginated batches. You must specify limit and offset to retrieve all records. The API caps limit at 1000. The following function loops until the returned array length falls below the requested limit, indicating the final page.
def fetch_utterance_logs(session: CognigySession, days_back: int = 30) -> list[Dict[str, Any]]:
import datetime
from = (datetime.datetime.utcnow() - datetime.timedelta(days=days_back)).strftime("%Y-%m-%dT%H:%M:%S.000Z")
to = datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S.000Z")
all_logs = []
offset = 0
limit = 1000
while True:
params = {"from": from, "to": to, "limit": limit, "offset": offset}
response = session.get("/logs", params=params)
response.raise_for_status()
data = response.json()
if not data:
break
all_logs.extend(data)
print(f"Fetched {len(data)} logs. Total: {len(all_logs)}")
if len(data) < limit:
break
offset += limit
return all_logs
Expected Response Structure:
[
{
"id": "log-8f3a2c1d-9e4b-4f1a-8c2d-7e5f6a9b0c1d",
"timestamp": "2024-05-12T14:32:10.000Z",
"utterance": "I want to cancel my subscription",
"intent": "cancel_subscription",
"confidence": 0.92,
"entities": [],
"channel": "webchat"
}
]
The endpoint requires the logs:read permission. If you receive a 403 response, verify that the API user has the correct role assigned in the Cognigy.AI admin console. The pagination loop terminates when the API returns fewer records than the limit parameter, which is the standard behavior for Cognigy.AI endpoints.
Step 2: Generate Embeddings and Apply K-Means Clustering
Cognigy.AI logs do not expose raw NLU embedding vectors. You must generate embeddings locally before clustering. The sentence-transformers library provides pre-trained models that convert text to dense vectors. K-Means then groups similar utterances by minimizing intra-cluster variance.
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
def cluster_utterances(logs: list[Dict[str, Any]], n_clusters: int = 10) -> pd.DataFrame:
utterances = [log["utterance"] for log in logs if log.get("utterance")]
if not utterances:
raise ValueError("No valid utterances found in logs.")
print("Generating embeddings...")
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(utterances, show_progress_bar=True, normalize_embeddings=True)
print(f"Running K-Means with {n_clusters} clusters...")
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init="auto", max_iter=300)
cluster_labels = kmeans.fit_predict(embeddings)
df = pd.DataFrame({
"utterance": utterances,
"original_intent": [log.get("intent", "unknown") for log in logs if log.get("utterance")],
"cluster_id": cluster_labels,
"embedding_vector": embeddings.tolist()
})
return df
Parameter Explanation:
normalize_embeddings=Trueensures cosine similarity behaves like dot product, which improves clustering stability for NLU data.n_init="auto"lets scikit-learn select the initialization strategy. For production runs with high-dimensional data, setn_init=10explicitly to avoid convergence warnings.n_clustersshould match the number of distinct intents you expect. You can determine this value using the elbow method on theinertia_attribute before final clustering.
Edge cases include empty utterances and highly skewed intent distributions. The code filters out missing utterances before vectorization. If a cluster contains fewer than 5 samples, it likely represents noise or rare edge cases. You should flag these clusters for manual review before retraining.
Step 3: Map Clusters to Intents and Export Training Set
After clustering, you must assign each cluster a representative intent. The majority voting approach selects the most frequent original_intent within each cluster. The final dataset contains the utterance, the assigned cluster label, and the suggested intent for Cognigy.AI retraining.
def generate_training_set(df: pd.DataFrame, min_cluster_size: int = 5) -> pd.DataFrame:
cluster_intent_map = {}
for cluster_id in df["cluster_id"].unique():
cluster_df = df[df["cluster_id"] == cluster_id]
if len(cluster_df) < min_cluster_size:
cluster_intent_map[cluster_id] = "review_required"
continue
dominant_intent = cluster_df["original_intent"].mode()[0]
cluster_intent_map[cluster_id] = dominant_intent
df["suggested_intent"] = df["cluster_id"].map(cluster_intent_map)
export_df = df[["utterance", "original_intent", "cluster_id", "suggested_intent"]].copy()
export_df = export_df.sort_values("cluster_id")
return export_df
This step produces a clean CSV-ready DataFrame. You can export it directly to Cognigy.AI’s NLU training format or import it via the console. The suggested_intent column replaces noisy or mislabeled original intents with cluster-validated labels. You should validate the mapping before bulk uploading to avoid overwriting correct training examples.
Complete Working Example
import base64
import time
import datetime
import requests
import numpy as np
import pandas as pd
from typing import Optional, Dict, Any
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
class CognigySession:
def __init__(self, account: str, username: str, password: str):
self.base_url = f"https://{account}.cognigy.ai/api/v1"
credentials = f"{username}:{password}"
self.auth_header = "Basic " + base64.b64encode(credentials.encode()).decode()
self.session = requests.Session()
self.session.headers.update({
"Authorization": self.auth_header,
"Content-Type": "application/json",
"Accept": "application/json"
})
def _request_with_retry(self, method: str, endpoint: str, params: Optional[Dict] = None, max_retries: int = 3) -> requests.Response:
url = f"{self.base_url}{endpoint}"
for attempt in range(max_retries):
response = self.session.request(method, url, params=params)
if response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", 2 ** attempt))
print(f"Rate limited. Retrying in {retry_after} seconds...")
time.sleep(retry_after)
continue
if response.status_code >= 500:
time.sleep(2 ** attempt)
continue
return response
raise requests.HTTPError(f"Failed after {max_retries} retries: {response.status_code}")
def get(self, endpoint: str, params: Optional[Dict] = None) -> requests.Response:
return self._request_with_retry("GET", endpoint, params)
def fetch_utterance_logs(session: CognigySession, days_back: int = 30) -> list[Dict[str, Any]]:
from = (datetime.datetime.utcnow() - datetime.timedelta(days=days_back)).strftime("%Y-%m-%dT%H:%M:%S.000Z")
to = datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%S.000Z")
all_logs = []
offset = 0
limit = 1000
while True:
params = {"from": from, "to": to, "limit": limit, "offset": offset}
response = session.get("/logs", params=params)
response.raise_for_status()
data = response.json()
if not data:
break
all_logs.extend(data)
print(f"Fetched {len(data)} logs. Total: {len(all_logs)}")
if len(data) < limit:
break
offset += limit
return all_logs
def cluster_utterances(logs: list[Dict[str, Any]], n_clusters: int = 10) -> pd.DataFrame:
utterances = [log["utterance"] for log in logs if log.get("utterance")]
if not utterances:
raise ValueError("No valid utterances found in logs.")
print("Generating embeddings...")
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(utterances, show_progress_bar=True, normalize_embeddings=True)
print(f"Running K-Means with {n_clusters} clusters...")
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init="auto", max_iter=300)
cluster_labels = kmeans.fit_predict(embeddings)
df = pd.DataFrame({
"utterance": utterances,
"original_intent": [log.get("intent", "unknown") for log in logs if log.get("utterance")],
"cluster_id": cluster_labels,
"embedding_vector": embeddings.tolist()
})
return df
def generate_training_set(df: pd.DataFrame, min_cluster_size: int = 5) -> pd.DataFrame:
cluster_intent_map = {}
for cluster_id in df["cluster_id"].unique():
cluster_df = df[df["cluster_id"] == cluster_id]
if len(cluster_df) < min_cluster_size:
cluster_intent_map[cluster_id] = "review_required"
continue
dominant_intent = cluster_df["original_intent"].mode()[0]
cluster_intent_map[cluster_id] = dominant_intent
df["suggested_intent"] = df["cluster_id"].map(cluster_intent_map)
export_df = df[["utterance", "original_intent", "cluster_id", "suggested_intent"]].copy()
export_df = export_df.sort_values("cluster_id")
return export_df
if __name__ == "__main__":
ACCOUNT = "your-account"
USERNAME = "api_user"
PASSWORD = "api_password"
session = CognigySession(ACCOUNT, USERNAME, PASSWORD)
print("Fetching logs...")
logs = fetch_utterance_logs(session, days_back=30)
print("Clustering utterances...")
clustered_df = cluster_utterances(logs, n_clusters=12)
print("Generating training set...")
training_df = generate_training_set(clustered_df, min_cluster_size=5)
output_file = "cognigy_training_set.csv"
training_df.to_csv(output_file, index=False, quoting=1)
print(f"Training set exported to {output_file}")
Replace ACCOUNT, USERNAME, and PASSWORD with your Cognigy.AI credentials. The script runs end-to-end: authentication, log extraction, vectorization, clustering, and CSV export. It requires approximately 200 MB of RAM for 50,000 utterances using the MiniLM model.
Common Errors & Debugging
Error: 401 Unauthorized
- Cause: Invalid API credentials or missing Base64 encoding in the Authorization header.
- Fix: Verify the username and password match an API user in Cognigy.AI. Ensure the user has
logs:readpermissions. Test the credentials with a direct curl command before running the script. - Code Fix: The
CognigySessionclass automatically encodes credentials. If you receive 401, print the decoded header to verify formatting.
Error: 429 Too Many Requests
- Cause: Exceeding Cognigy.AI rate limits during bulk log extraction. The API enforces per-user request throttling.
- Fix: The
_request_with_retrymethod implements exponential backoff. If failures persist, reduce the extraction window (days_back) or add a fixed delay between pagination loops. - Code Fix: Adjust
max_retriesin the session initializer or increase the base retry delay.
Error: MemoryError during embedding generation
- Cause: Loading 100,000+ utterances into RAM for vectorization. The
sentence-transformersmodel loads embeddings as a contiguous NumPy array. - Fix: Process logs in batches. Split the DataFrame into chunks of 10,000 rows, encode each chunk, and concatenate the vectors before clustering.
- Code Fix: Replace the single
model.encode()call with a loop that appends to a list, then stack withnp.vstack().
Error: ConvergenceWarning from K-Means
- Cause: The algorithm failed to converge within
max_iteriterations. This occurs when clusters are poorly separated orn_clustersis too high for the data distribution. - Fix: Increase
max_iterto 500 or reducen_clusters. Inspect theinertia_attribute to locate the elbow point. - Code Fix: Add
n_init=10explicitly to force multiple initializations and improve stability.
Error: Empty or malformed CSV output
- Cause: Filtering logic removed all utterances, or the logs endpoint returned metadata instead of conversation records.
- Fix: Verify the
fromandtoparameters use ISO 8601 format withZsuffix. Add a validation step that checkslen(logs) > 0before proceeding to clustering. - Code Fix: Insert
assert len(logs) > 0, "No logs retrieved. Check date range and permissions."after the fetch step.