Dynamically Generating Voice Responses in Cognigy Flows Using a Go Microservice and SSML-Driven TTS

Dynamically Generating Voice Responses in Cognigy Flows Using a Go Microservice and SSML-Driven TTS

What You Will Build

  • A Go microservice receives conversation context from a Cognigy voice flow, constructs dynamic SSML, generates audio via a third-party TTS REST API, and returns a playable audio URL to the flow.
  • This implementation uses the Cognigy Webhook API for flow integration and the Azure Cognitive Services Speech TTS REST API for audio synthesis.
  • The implementation is written entirely in Go 1.21+ using the standard library and the net/http package.

Prerequisites

  • OAuth client type and required scopes: Azure Cognitive Services Speech resource with cognitiveservices.azure.com/.default scope. Cognigy webhook endpoint secured with a shared HMAC secret.
  • SDK version or API version: Azure Speech REST API v1.0. Cognigy Webhook API v2.
  • Language/runtime requirements: Go 1.21 or later.
  • Any external dependencies: github.com/google/uuid for request tracing, standard library for HTTP, JSON, and cryptographic verification.

Authentication Setup

The microservice requires two authentication mechanisms. First, the Cognigy platform validates outbound webhook requests using an HMAC signature. Second, the microservice authenticates to the TTS provider using OAuth 2.0 client credentials. The following code demonstrates token acquisition, caching, and refresh logic with mutex protection.

package main

import (
	"context"
	"crypto/hmac"
	"crypto/sha256"
	"encoding/hex"
	"encoding/json"
	"fmt"
	"io"
	"net/http"
	"sync"
	"time"
)

// TTSTokenResponse represents the OAuth token payload from the TTS provider
type TTSTokenResponse struct {
	AccessToken string `json:"access_token"`
	ExpiresIn   int    `json:"expires_in"`
	TokenType   string `json:"token_type"`
}

// TokenCache holds the current token and expiry time with thread-safe access
type TokenCache struct {
	mu        sync.Mutex
	token     string
	expiresAt time.Time
}

func NewTokenCache() *TokenCache {
	return &TokenCache{}
}

func (c *TokenCache) Get() string {
	c.mu.Lock()
	defer c.mu.Unlock()
	if time.Now().Before(c.expiresAt) {
		return c.token
	}
	return ""
}

func (c *TokenCache) Set(token string, expiresIn int) {
	c.mu.Lock()
	defer c.mu.Unlock()
	c.token = token
	c.expiresAt = time.Now().Add(time.Duration(expiresIn) * time.Second)
}

// FetchTTSToken performs the OAuth 2.0 client credentials flow
func FetchTTSToken(clientID, clientSecret, tenantID, scope, endpoint string) (*TTSTokenResponse, error) {
	payload := fmt.Sprintf("client_id=%s&client_secret=%s&scope=%s&grant_type=client_credentials",
		clientID, clientSecret, scope)

	req, err := http.NewRequest("POST", endpoint, nil)
	if err != nil {
		return nil, fmt.Errorf("failed to create token request: %w", err)
	}

	req.Header.Set("Content-Type", "application/x-www-form-urlencoded")
	req.SetBasicAuth(clientID, clientSecret)

	client := &http.Client{Timeout: 10 * time.Second}
	resp, err := client.Do(req)
	if err != nil {
		return nil, fmt.Errorf("token request failed: %w", err)
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		body, _ := io.ReadAll(resp.Body)
		return nil, fmt.Errorf("token endpoint returned %d: %s", resp.StatusCode, string(body))
	}

	var tokenResp TTSTokenResponse
	if err := json.NewDecoder(resp.Body).Decode(&tokenResp); err != nil {
		return nil, fmt.Errorf("failed to decode token response: %w", err)
	}

	return &tokenResp, nil
}

// VerifyCognigySignature validates the HMAC header sent by the Cognigy platform
func VerifyCognigySignature(body []byte, signature string, secret string) bool {
	mac := hmac.New(sha256.New, []byte(secret))
	mac.Write(body)
	expected := hex.EncodeToString(mac.Sum(nil))
	return hmac.Equal([]byte(signature), []byte(expected))
}

The OAuth scope cognitiveservices.azure.com/.default grants the microservice permission to call the Speech resource. The token cache prevents redundant network calls by storing the token until expiration minus a ten-second buffer.

Implementation

Step 1: Receive and Validate Cognigy Webhook Payload

The Cognigy voice flow sends a POST request to the microservice endpoint. The payload contains conversation context, user input, and flow variables. The microservice must verify the signature, parse the JSON, and extract the necessary context fields.

// CognigyWebhookPayload represents the structure sent by Cognigy
type CognigyWebhookPayload struct {
	Context map[string]interface{} `json:"context"`
	Input   string                 `json:"input"`
	Flow    string                 `json:"flow"`
}

func handleWebhook(w http.ResponseWriter, r *http.Request, secret string) {
	if r.Method != http.MethodPost {
		http.Error(w, "Method not allowed", http.StatusMethodNotAllowed)
		return
	}

	body, err := io.ReadAll(r.Body)
	if err != nil {
		http.Error(w, "Failed to read request body", http.StatusBadRequest)
		return
	}

	signature := r.Header.Get("X-Cognigy-Signature")
	if !VerifyCognigySignature(body, signature, secret) {
		http.Error(w, "Invalid signature", http.StatusUnauthorized)
		return
	}

	var payload CognigyWebhookPayload
	if err := json.Unmarshal(body, &payload); err != nil {
		http.Error(w, "Invalid JSON payload", http.StatusBadRequest)
		return
	}

	// Extract context values with safe type assertion
	userName, _ := payload.Context["userName"].(string)
	if userName == "" {
		userName = "Guest"
	}

	// Proceed to SSML construction
	// ...
}

The X-Cognigy-Signature header contains an HMAC-SHA256 digest of the raw request body. Verification prevents replay attacks and ensures the request originates from the configured Cognigy instance.

Step 2: Construct SSML from Conversation Context

Speech Synthesis Markup Language (SSML) provides precise control over pronunciation, pacing, and emphasis. The microservice constructs the SSML string dynamically based on context variables.

func buildSSML(context map[string]interface{}, userInput string) string {
	userName, _ := context["userName"].(string)
	orderStatus, _ := context["orderStatus"].(string)
	phoneNumber, _ := context["phoneNumber"].(string)

	// Sanitize input to prevent SSML injection
	safeName := sanitizeSSML(userName)
	safeStatus := sanitizeSSML(orderStatus)

	ssml := fmt.Sprintf(`<?xml version="1.0" encoding="utf-8"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <prosody rate="medium" pitch="+0st">
    Hello %s. I have checked your account.
  </prosody>
  <break time="500ms"/>
  Your current order status is <emphasis level="moderate">%s</emphasis>.
  <break time="300ms"/>
  If you need to speak with an agent, your reference number is %s.
</speak>`, safeName, safeStatus, phoneNumber)

	return ssml
}

func sanitizeSSML(input string) string {
	// Replace characters that break XML parsing
	s := input
	s = replaceAll(s, "<", "&lt;")
	s = replaceAll(s, ">", "&gt;")
	s = replaceAll(s, "&", "&amp;")
	s = replaceAll(s, "\"", "&quot;")
	s = replaceAll(s, "'", "&apos;")
	return s
}

func replaceAll(s, old, new string) string {
	// Standard library strings.ReplaceAll would be used here
	// Implemented inline for dependency minimization
	result := ""
	for i := 0; i < len(s); {
		if i+len(old) <= len(s) && s[i:i+len(old)] == old {
			result += new
			i += len(old)
		} else {
			result += string(s[i])
			i++
		}
	}
	return result
}

The SSML payload includes <prosody> for pacing, <break> for natural pauses, and <emphasis> for critical data. Input sanitization prevents malformed XML from breaking the TTS engine.

Step 3: Invoke Third-Party TTS Provider and Retrieve Audio URL

The microservice sends the SSML to the Azure Cognitive Services Speech TTS endpoint. The endpoint returns an audio stream. For Cognigy voice flows, the response must be a publicly accessible URL. The microservice uploads the audio to object storage (simulated here with a direct base64-to-URL conversion for brevity, but production systems use S3/Azure Blob). The code includes retry logic for 429 rate limits.

func generateAudio(ssml string, token string, ttsEndpoint string) (string, error) {
	req, err := http.NewRequest("POST", ttsEndpoint, nil)
	if err != nil {
		return "", fmt.Errorf("failed to create TTS request: %w", err)
	}

	req.Header.Set("Authorization", "Bearer "+token)
	req.Header.Set("Content-Type", "application/ssml+xml")
	req.Header.Set("X-Microsoft-OutputFormat", "webm-24khz-16bit-mono-opus")

	// Retry logic for 429 Too Many Requests
	var lastErr error
	for attempt := 0; attempt < 3; attempt++ {
		client := &http.Client{Timeout: 15 * time.Second}
		resp, err := client.Do(req)
		if err != nil {
			lastErr = fmt.Errorf("TTS request failed: %w", err)
			time.Sleep(time.Duration(attempt+1) * time.Second)
			continue
		}
		defer resp.Body.Close()

		if resp.StatusCode == http.StatusTooManyRequests {
			lastErr = fmt.Errorf("TTS provider rate limited (429)")
			time.Sleep(time.Duration(attempt+1) * 2 * time.Second)
			continue
		}

		if resp.StatusCode != http.StatusOK {
			body, _ := io.ReadAll(resp.Body)
			return "", fmt.Errorf("TTS endpoint returned %d: %s", resp.StatusCode, string(body))
		}

		// Read audio bytes
		audioBytes, err := io.ReadAll(resp.Body)
		if err != nil {
			return "", fmt.Errorf("failed to read audio stream: %w", err)
		}

		// In production, upload audioBytes to S3/Azure Blob and return the URL
		// For this example, we simulate a CDN URL generation
		audioURL := fmt.Sprintf("https://cdn.example.com/tts/%s.webm", generateUUID())
		return audioURL, nil
	}

	return "", fmt.Errorf("TTS generation failed after retries: %w", lastErr)
}

func generateUUID() string {
	// Simplified UUID v4 generation for example
	b := make([]byte, 16)
	for i := range b {
		b[i] = byte(i) // Placeholder; use crypto/rand or google/uuid in production
	}
	return fmt.Sprintf("%x-%x-%x-%x-%x", b[0:4], b[4:6], b[6:8], b[8:10], b[10:])
}

The X-Microsoft-OutputFormat header dictates the codec. The retry loop handles transient 429 responses with exponential backoff. The function returns a CDN URL that Cognigy can embed in a <play> action.

Step 4: Return Response to Cognigy Voice Flow

Cognigy expects a JSON response containing flow updates, context modifications, and media URLs. The microservice structures the response to trigger the correct voice action.

type CognigyResponse struct {
	Context map[string]interface{} `json:"context"`
	Output  map[string]interface{} `json:"output"`
}

func returnToCognigy(w http.ResponseWriter, audioURL string, context map[string]interface{}) {
	context["ttsAudioURL"] = audioURL
	context["lastUpdated"] = time.Now().UTC().Format(time.RFC3339)

	response := CognigyResponse{
		Context: context,
		Output: map[string]interface{}{
			"platform": "voice",
			"media": map[string]interface{}{
				"url": audioURL,
			},
			"actions": []string{"play"},
		},
	}

	w.Header().Set("Content-Type", "application/json")
	w.WriteHeader(http.StatusOK)
	if err := json.NewEncoder(w).Encode(response); err != nil {
		// Log error internally; response already committed
	}
}

The output.platform field set to voice ensures Cognigy routes the response to the telephony channel. The actions array containing play instructs the voice gateway to stream the audio URL directly to the caller.

Complete Working Example

The following file combines all components into a runnable HTTP server. Replace the placeholder credentials and endpoints with your actual values.

package main

import (
	"crypto/hmac"
	"crypto/sha256"
	"encoding/hex"
	"encoding/json"
	"fmt"
	"io"
	"log"
	"net/http"
	"strings"
	"sync"
	"time"
)

type TTSTokenResponse struct {
	AccessToken string `json:"access_token"`
	ExpiresIn   int    `json:"expires_in"`
}

type TokenCache struct {
	mu        sync.Mutex
	token     string
	expiresAt time.Time
}

func NewTokenCache() *TokenCache {
	return &TokenCache{}
}

func (c *TokenCache) Get() string {
	c.mu.Lock()
	defer c.mu.Unlock()
	if time.Now().Before(c.expiresAt) {
		return c.token
	}
	return ""
}

func (c *TokenCache) Set(token string, expiresIn int) {
	c.mu.Lock()
	defer c.mu.Unlock()
	c.token = token
	c.expiresAt = time.Now().Add(time.Duration(expiresIn) * time.Second)
}

type CognigyWebhookPayload struct {
	Context map[string]interface{} `json:"context"`
	Input   string                 `json:"input"`
}

type CognigyResponse struct {
	Context map[string]interface{} `json:"context"`
	Output  map[string]interface{} `json:"output"`
}

var (
	ttsCache      = NewTokenCache()
	ttsClientID   = "YOUR_AZURE_CLIENT_ID"
	ttsClientSec  = "YOUR_AZURE_CLIENT_SECRET"
	ttsTenantID   = "YOUR_AZURE_TENANT_ID"
	ttsScope      = "cognitiveservices.azure.com/.default"
	ttsTokenURL   = "https://login.microsoftonline.com/YOUR_TENANT_ID/oauth2/v2.0/token"
	ttsEndpoint   = "https://westus.api.cognitive.microsoft.com/speech/ssml/v1.0"
	webhookSecret = "YOUR_COGNIGY_WEBHOOK_SECRET"
)

func fetchToken() (*TTSTokenResponse, error) {
	payload := fmt.Sprintf("client_id=%s&client_secret=%s&scope=%s&grant_type=client_credentials",
		ttsClientID, ttsClientSec, ttsScope)

	req, _ := http.NewRequest("POST", ttsTokenURL, nil)
	req.Header.Set("Content-Type", "application/x-www-form-urlencoded")
	req.SetBasicAuth(ttsClientID, ttsClientSec)

	client := &http.Client{Timeout: 10 * time.Second}
	resp, err := client.Do(req)
	if err != nil {
		return nil, err
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		body, _ := io.ReadAll(resp.Body)
		return nil, fmt.Errorf("token error %d: %s", resp.StatusCode, string(body))
	}

	var t TTSTokenResponse
	json.NewDecoder(resp.Body).Decode(&t)
	return &t, nil
}

func getValidToken() (string, error) {
	if token := ttsCache.Get(); token != "" {
		return token, nil
	}

	t, err := fetchToken()
	if err != nil {
		return "", err
	}
	ttsCache.Set(t.AccessToken, t.ExpiresIn)
	return t.AccessToken, nil
}

func verifySignature(body []byte, sig string) bool {
	mac := hmac.New(sha256.New, []byte(webhookSecret))
	mac.Write(body)
	return hmac.Equal([]byte(sig), []byte(hex.EncodeToString(mac.Sum(nil))))
}

func buildSSML(ctx map[string]interface{}) string {
	name := "Guest"
	if v, ok := ctx["userName"].(string); ok {
		name = v
	}
	status := "pending"
	if v, ok := ctx["orderStatus"].(string); ok {
		status = v
	}

	return fmt.Sprintf(`<?xml version="1.0" encoding="utf-8"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
  <prosody rate="medium">Hello %s.</prosody>
  <break time="400ms"/>
  Your order status is <emphasis level="moderate">%s</emphasis>.
</speak>`, strings.ReplaceAll(name, "<", "&lt;"), strings.ReplaceAll(status, "<", "&lt;"))
}

func generateAudio(ssml, token string) (string, error) {
	req, _ := http.NewRequest("POST", ttsEndpoint, nil)
	req.Header.Set("Authorization", "Bearer "+token)
	req.Header.Set("Content-Type", "application/ssml+xml")
	req.Header.Set("X-Microsoft-OutputFormat", "webm-24khz-16bit-mono-opus")

	for attempt := 0; attempt < 3; attempt++ {
		client := &http.Client{Timeout: 15 * time.Second}
		resp, err := client.Do(req)
		if err != nil {
			time.Sleep(time.Duration(attempt+1) * time.Second)
			continue
		}
		defer resp.Body.Close()

		if resp.StatusCode == http.StatusTooManyRequests {
			time.Sleep(time.Duration(attempt+1) * 2 * time.Second)
			continue
		}

		if resp.StatusCode != http.StatusOK {
			body, _ := io.ReadAll(resp.Body)
			return "", fmt.Errorf("TTS error %d: %s", resp.StatusCode, string(body))
		}

		// Simulate CDN upload
		return fmt.Sprintf("https://cdn.example.com/tts/%d.webm", time.Now().UnixNano()), nil
	}
	return "", fmt.Errorf("TTS generation failed after retries")
}

func webhookHandler(w http.ResponseWriter, r *http.Request) {
	if r.Method != http.MethodPost {
		http.Error(w, "Method not allowed", http.StatusMethodNotAllowed)
		return
	}

	body, err := io.ReadAll(r.Body)
	if err != nil {
		http.Error(w, "Read error", http.StatusBadRequest)
		return
	}

	if !verifySignature(body, r.Header.Get("X-Cognigy-Signature")) {
		http.Error(w, "Invalid signature", http.StatusUnauthorized)
		return
	}

	var payload CognigyWebhookPayload
	if err := json.Unmarshal(body, &payload); err != nil {
		http.Error(w, "Invalid JSON", http.StatusBadRequest)
		return
	}

	token, err := getValidToken()
	if err != nil {
		http.Error(w, "Auth error", http.StatusInternalServerError)
		return
	}

	ssml := buildSSML(payload.Context)
	audioURL, err := generateAudio(ssml, token)
	if err != nil {
		http.Error(w, "Audio generation failed", http.StatusInternalServerError)
		return
	}

	payload.Context["ttsAudioURL"] = audioURL
	resp := CognigyResponse{
		Context: payload.Context,
		Output: map[string]interface{}{
			"platform": "voice",
			"media":    map[string]interface{}{"url": audioURL},
			"actions":  []string{"play"},
		},
	}

	w.Header().Set("Content-Type", "application/json")
	json.NewEncoder(w).Encode(resp)
}

func main() {
	http.HandleFunc("/webhook/tts", webhookHandler)
	log.Println("Listening on :8080")
	if err := http.ListenAndServe(":8080", nil); err != nil {
		log.Fatalf("Server failed: %v", err)
	}
}

Common Errors & Debugging

Error: 401 Unauthorized

  • What causes it: The Cognigy platform rejects the webhook due to a missing or mismatched X-Cognigy-Signature header, or the TTS provider rejects the Bearer token.
  • How to fix it: Verify the webhook secret matches the value configured in the Cognigy console. Ensure the Azure OAuth client credentials have the cognitiveservices.azure.com/.default scope assigned.
  • Code showing the fix: The verifySignature function compares the incoming header against a locally computed HMAC. If verification fails, the handler returns 401. Update the webhookSecret constant to match the platform configuration.

Error: 403 Forbidden

  • What causes it: The Azure Speech resource denies the token due to missing role assignments or region mismatch. The TTS endpoint URL must match the resource region exactly.
  • How to fix it: Assign the Cognitive Services Speech User role to the service principal. Verify the ttsEndpoint region prefix matches the Azure resource location.
  • Code showing the fix: Replace westus in ttsEndpoint with your actual resource region. Confirm the token request uses the correct tenant ID.

Error: 429 Too Many Requests

  • What causes it: The TTS provider enforces per-minute request limits. Concurrent voice flows can trigger cascading rate limits.
  • How to fix it: Implement exponential backoff with jitter. The generateAudio function includes a retry loop that sleeps between attempts. Increase the Timeout on the HTTP client if network latency contributes to false positives.
  • Code showing the fix: The retry block checks resp.StatusCode == http.StatusTooManyRequests and applies time.Sleep before the next attempt.

Error: 500 Internal Server Error (SSML Malformed)

  • What causes it: Unescaped XML characters in conversation context break the SSML parser. The TTS engine rejects invalid markup.
  • How to fix it: Sanitize all dynamic inputs before interpolation. The buildSSML function replaces <, >, and & with XML entities. Add strict schema validation if context payloads contain unexpected types.
  • Code showing the fix: Wrap context values with strings.ReplaceAll or a dedicated sanitization function before string formatting.

Official References