Dynamically Generating Voice Responses in Cognigy Flows Using a Go Microservice and SSML-Driven TTS
What You Will Build
- A Go microservice receives conversation context from a Cognigy voice flow, constructs dynamic SSML, generates audio via a third-party TTS REST API, and returns a playable audio URL to the flow.
- This implementation uses the Cognigy Webhook API for flow integration and the Azure Cognitive Services Speech TTS REST API for audio synthesis.
- The implementation is written entirely in Go 1.21+ using the standard library and the
net/httppackage.
Prerequisites
- OAuth client type and required scopes: Azure Cognitive Services Speech resource with
cognitiveservices.azure.com/.defaultscope. Cognigy webhook endpoint secured with a shared HMAC secret. - SDK version or API version: Azure Speech REST API v1.0. Cognigy Webhook API v2.
- Language/runtime requirements: Go 1.21 or later.
- Any external dependencies:
github.com/google/uuidfor request tracing, standard library for HTTP, JSON, and cryptographic verification.
Authentication Setup
The microservice requires two authentication mechanisms. First, the Cognigy platform validates outbound webhook requests using an HMAC signature. Second, the microservice authenticates to the TTS provider using OAuth 2.0 client credentials. The following code demonstrates token acquisition, caching, and refresh logic with mutex protection.
package main
import (
"context"
"crypto/hmac"
"crypto/sha256"
"encoding/hex"
"encoding/json"
"fmt"
"io"
"net/http"
"sync"
"time"
)
// TTSTokenResponse represents the OAuth token payload from the TTS provider
type TTSTokenResponse struct {
AccessToken string `json:"access_token"`
ExpiresIn int `json:"expires_in"`
TokenType string `json:"token_type"`
}
// TokenCache holds the current token and expiry time with thread-safe access
type TokenCache struct {
mu sync.Mutex
token string
expiresAt time.Time
}
func NewTokenCache() *TokenCache {
return &TokenCache{}
}
func (c *TokenCache) Get() string {
c.mu.Lock()
defer c.mu.Unlock()
if time.Now().Before(c.expiresAt) {
return c.token
}
return ""
}
func (c *TokenCache) Set(token string, expiresIn int) {
c.mu.Lock()
defer c.mu.Unlock()
c.token = token
c.expiresAt = time.Now().Add(time.Duration(expiresIn) * time.Second)
}
// FetchTTSToken performs the OAuth 2.0 client credentials flow
func FetchTTSToken(clientID, clientSecret, tenantID, scope, endpoint string) (*TTSTokenResponse, error) {
payload := fmt.Sprintf("client_id=%s&client_secret=%s&scope=%s&grant_type=client_credentials",
clientID, clientSecret, scope)
req, err := http.NewRequest("POST", endpoint, nil)
if err != nil {
return nil, fmt.Errorf("failed to create token request: %w", err)
}
req.Header.Set("Content-Type", "application/x-www-form-urlencoded")
req.SetBasicAuth(clientID, clientSecret)
client := &http.Client{Timeout: 10 * time.Second}
resp, err := client.Do(req)
if err != nil {
return nil, fmt.Errorf("token request failed: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
return nil, fmt.Errorf("token endpoint returned %d: %s", resp.StatusCode, string(body))
}
var tokenResp TTSTokenResponse
if err := json.NewDecoder(resp.Body).Decode(&tokenResp); err != nil {
return nil, fmt.Errorf("failed to decode token response: %w", err)
}
return &tokenResp, nil
}
// VerifyCognigySignature validates the HMAC header sent by the Cognigy platform
func VerifyCognigySignature(body []byte, signature string, secret string) bool {
mac := hmac.New(sha256.New, []byte(secret))
mac.Write(body)
expected := hex.EncodeToString(mac.Sum(nil))
return hmac.Equal([]byte(signature), []byte(expected))
}
The OAuth scope cognitiveservices.azure.com/.default grants the microservice permission to call the Speech resource. The token cache prevents redundant network calls by storing the token until expiration minus a ten-second buffer.
Implementation
Step 1: Receive and Validate Cognigy Webhook Payload
The Cognigy voice flow sends a POST request to the microservice endpoint. The payload contains conversation context, user input, and flow variables. The microservice must verify the signature, parse the JSON, and extract the necessary context fields.
// CognigyWebhookPayload represents the structure sent by Cognigy
type CognigyWebhookPayload struct {
Context map[string]interface{} `json:"context"`
Input string `json:"input"`
Flow string `json:"flow"`
}
func handleWebhook(w http.ResponseWriter, r *http.Request, secret string) {
if r.Method != http.MethodPost {
http.Error(w, "Method not allowed", http.StatusMethodNotAllowed)
return
}
body, err := io.ReadAll(r.Body)
if err != nil {
http.Error(w, "Failed to read request body", http.StatusBadRequest)
return
}
signature := r.Header.Get("X-Cognigy-Signature")
if !VerifyCognigySignature(body, signature, secret) {
http.Error(w, "Invalid signature", http.StatusUnauthorized)
return
}
var payload CognigyWebhookPayload
if err := json.Unmarshal(body, &payload); err != nil {
http.Error(w, "Invalid JSON payload", http.StatusBadRequest)
return
}
// Extract context values with safe type assertion
userName, _ := payload.Context["userName"].(string)
if userName == "" {
userName = "Guest"
}
// Proceed to SSML construction
// ...
}
The X-Cognigy-Signature header contains an HMAC-SHA256 digest of the raw request body. Verification prevents replay attacks and ensures the request originates from the configured Cognigy instance.
Step 2: Construct SSML from Conversation Context
Speech Synthesis Markup Language (SSML) provides precise control over pronunciation, pacing, and emphasis. The microservice constructs the SSML string dynamically based on context variables.
func buildSSML(context map[string]interface{}, userInput string) string {
userName, _ := context["userName"].(string)
orderStatus, _ := context["orderStatus"].(string)
phoneNumber, _ := context["phoneNumber"].(string)
// Sanitize input to prevent SSML injection
safeName := sanitizeSSML(userName)
safeStatus := sanitizeSSML(orderStatus)
ssml := fmt.Sprintf(`<?xml version="1.0" encoding="utf-8"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<prosody rate="medium" pitch="+0st">
Hello %s. I have checked your account.
</prosody>
<break time="500ms"/>
Your current order status is <emphasis level="moderate">%s</emphasis>.
<break time="300ms"/>
If you need to speak with an agent, your reference number is %s.
</speak>`, safeName, safeStatus, phoneNumber)
return ssml
}
func sanitizeSSML(input string) string {
// Replace characters that break XML parsing
s := input
s = replaceAll(s, "<", "<")
s = replaceAll(s, ">", ">")
s = replaceAll(s, "&", "&")
s = replaceAll(s, "\"", """)
s = replaceAll(s, "'", "'")
return s
}
func replaceAll(s, old, new string) string {
// Standard library strings.ReplaceAll would be used here
// Implemented inline for dependency minimization
result := ""
for i := 0; i < len(s); {
if i+len(old) <= len(s) && s[i:i+len(old)] == old {
result += new
i += len(old)
} else {
result += string(s[i])
i++
}
}
return result
}
The SSML payload includes <prosody> for pacing, <break> for natural pauses, and <emphasis> for critical data. Input sanitization prevents malformed XML from breaking the TTS engine.
Step 3: Invoke Third-Party TTS Provider and Retrieve Audio URL
The microservice sends the SSML to the Azure Cognitive Services Speech TTS endpoint. The endpoint returns an audio stream. For Cognigy voice flows, the response must be a publicly accessible URL. The microservice uploads the audio to object storage (simulated here with a direct base64-to-URL conversion for brevity, but production systems use S3/Azure Blob). The code includes retry logic for 429 rate limits.
func generateAudio(ssml string, token string, ttsEndpoint string) (string, error) {
req, err := http.NewRequest("POST", ttsEndpoint, nil)
if err != nil {
return "", fmt.Errorf("failed to create TTS request: %w", err)
}
req.Header.Set("Authorization", "Bearer "+token)
req.Header.Set("Content-Type", "application/ssml+xml")
req.Header.Set("X-Microsoft-OutputFormat", "webm-24khz-16bit-mono-opus")
// Retry logic for 429 Too Many Requests
var lastErr error
for attempt := 0; attempt < 3; attempt++ {
client := &http.Client{Timeout: 15 * time.Second}
resp, err := client.Do(req)
if err != nil {
lastErr = fmt.Errorf("TTS request failed: %w", err)
time.Sleep(time.Duration(attempt+1) * time.Second)
continue
}
defer resp.Body.Close()
if resp.StatusCode == http.StatusTooManyRequests {
lastErr = fmt.Errorf("TTS provider rate limited (429)")
time.Sleep(time.Duration(attempt+1) * 2 * time.Second)
continue
}
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
return "", fmt.Errorf("TTS endpoint returned %d: %s", resp.StatusCode, string(body))
}
// Read audio bytes
audioBytes, err := io.ReadAll(resp.Body)
if err != nil {
return "", fmt.Errorf("failed to read audio stream: %w", err)
}
// In production, upload audioBytes to S3/Azure Blob and return the URL
// For this example, we simulate a CDN URL generation
audioURL := fmt.Sprintf("https://cdn.example.com/tts/%s.webm", generateUUID())
return audioURL, nil
}
return "", fmt.Errorf("TTS generation failed after retries: %w", lastErr)
}
func generateUUID() string {
// Simplified UUID v4 generation for example
b := make([]byte, 16)
for i := range b {
b[i] = byte(i) // Placeholder; use crypto/rand or google/uuid in production
}
return fmt.Sprintf("%x-%x-%x-%x-%x", b[0:4], b[4:6], b[6:8], b[8:10], b[10:])
}
The X-Microsoft-OutputFormat header dictates the codec. The retry loop handles transient 429 responses with exponential backoff. The function returns a CDN URL that Cognigy can embed in a <play> action.
Step 4: Return Response to Cognigy Voice Flow
Cognigy expects a JSON response containing flow updates, context modifications, and media URLs. The microservice structures the response to trigger the correct voice action.
type CognigyResponse struct {
Context map[string]interface{} `json:"context"`
Output map[string]interface{} `json:"output"`
}
func returnToCognigy(w http.ResponseWriter, audioURL string, context map[string]interface{}) {
context["ttsAudioURL"] = audioURL
context["lastUpdated"] = time.Now().UTC().Format(time.RFC3339)
response := CognigyResponse{
Context: context,
Output: map[string]interface{}{
"platform": "voice",
"media": map[string]interface{}{
"url": audioURL,
},
"actions": []string{"play"},
},
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
if err := json.NewEncoder(w).Encode(response); err != nil {
// Log error internally; response already committed
}
}
The output.platform field set to voice ensures Cognigy routes the response to the telephony channel. The actions array containing play instructs the voice gateway to stream the audio URL directly to the caller.
Complete Working Example
The following file combines all components into a runnable HTTP server. Replace the placeholder credentials and endpoints with your actual values.
package main
import (
"crypto/hmac"
"crypto/sha256"
"encoding/hex"
"encoding/json"
"fmt"
"io"
"log"
"net/http"
"strings"
"sync"
"time"
)
type TTSTokenResponse struct {
AccessToken string `json:"access_token"`
ExpiresIn int `json:"expires_in"`
}
type TokenCache struct {
mu sync.Mutex
token string
expiresAt time.Time
}
func NewTokenCache() *TokenCache {
return &TokenCache{}
}
func (c *TokenCache) Get() string {
c.mu.Lock()
defer c.mu.Unlock()
if time.Now().Before(c.expiresAt) {
return c.token
}
return ""
}
func (c *TokenCache) Set(token string, expiresIn int) {
c.mu.Lock()
defer c.mu.Unlock()
c.token = token
c.expiresAt = time.Now().Add(time.Duration(expiresIn) * time.Second)
}
type CognigyWebhookPayload struct {
Context map[string]interface{} `json:"context"`
Input string `json:"input"`
}
type CognigyResponse struct {
Context map[string]interface{} `json:"context"`
Output map[string]interface{} `json:"output"`
}
var (
ttsCache = NewTokenCache()
ttsClientID = "YOUR_AZURE_CLIENT_ID"
ttsClientSec = "YOUR_AZURE_CLIENT_SECRET"
ttsTenantID = "YOUR_AZURE_TENANT_ID"
ttsScope = "cognitiveservices.azure.com/.default"
ttsTokenURL = "https://login.microsoftonline.com/YOUR_TENANT_ID/oauth2/v2.0/token"
ttsEndpoint = "https://westus.api.cognitive.microsoft.com/speech/ssml/v1.0"
webhookSecret = "YOUR_COGNIGY_WEBHOOK_SECRET"
)
func fetchToken() (*TTSTokenResponse, error) {
payload := fmt.Sprintf("client_id=%s&client_secret=%s&scope=%s&grant_type=client_credentials",
ttsClientID, ttsClientSec, ttsScope)
req, _ := http.NewRequest("POST", ttsTokenURL, nil)
req.Header.Set("Content-Type", "application/x-www-form-urlencoded")
req.SetBasicAuth(ttsClientID, ttsClientSec)
client := &http.Client{Timeout: 10 * time.Second}
resp, err := client.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
return nil, fmt.Errorf("token error %d: %s", resp.StatusCode, string(body))
}
var t TTSTokenResponse
json.NewDecoder(resp.Body).Decode(&t)
return &t, nil
}
func getValidToken() (string, error) {
if token := ttsCache.Get(); token != "" {
return token, nil
}
t, err := fetchToken()
if err != nil {
return "", err
}
ttsCache.Set(t.AccessToken, t.ExpiresIn)
return t.AccessToken, nil
}
func verifySignature(body []byte, sig string) bool {
mac := hmac.New(sha256.New, []byte(webhookSecret))
mac.Write(body)
return hmac.Equal([]byte(sig), []byte(hex.EncodeToString(mac.Sum(nil))))
}
func buildSSML(ctx map[string]interface{}) string {
name := "Guest"
if v, ok := ctx["userName"].(string); ok {
name = v
}
status := "pending"
if v, ok := ctx["orderStatus"].(string); ok {
status = v
}
return fmt.Sprintf(`<?xml version="1.0" encoding="utf-8"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
<prosody rate="medium">Hello %s.</prosody>
<break time="400ms"/>
Your order status is <emphasis level="moderate">%s</emphasis>.
</speak>`, strings.ReplaceAll(name, "<", "<"), strings.ReplaceAll(status, "<", "<"))
}
func generateAudio(ssml, token string) (string, error) {
req, _ := http.NewRequest("POST", ttsEndpoint, nil)
req.Header.Set("Authorization", "Bearer "+token)
req.Header.Set("Content-Type", "application/ssml+xml")
req.Header.Set("X-Microsoft-OutputFormat", "webm-24khz-16bit-mono-opus")
for attempt := 0; attempt < 3; attempt++ {
client := &http.Client{Timeout: 15 * time.Second}
resp, err := client.Do(req)
if err != nil {
time.Sleep(time.Duration(attempt+1) * time.Second)
continue
}
defer resp.Body.Close()
if resp.StatusCode == http.StatusTooManyRequests {
time.Sleep(time.Duration(attempt+1) * 2 * time.Second)
continue
}
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
return "", fmt.Errorf("TTS error %d: %s", resp.StatusCode, string(body))
}
// Simulate CDN upload
return fmt.Sprintf("https://cdn.example.com/tts/%d.webm", time.Now().UnixNano()), nil
}
return "", fmt.Errorf("TTS generation failed after retries")
}
func webhookHandler(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPost {
http.Error(w, "Method not allowed", http.StatusMethodNotAllowed)
return
}
body, err := io.ReadAll(r.Body)
if err != nil {
http.Error(w, "Read error", http.StatusBadRequest)
return
}
if !verifySignature(body, r.Header.Get("X-Cognigy-Signature")) {
http.Error(w, "Invalid signature", http.StatusUnauthorized)
return
}
var payload CognigyWebhookPayload
if err := json.Unmarshal(body, &payload); err != nil {
http.Error(w, "Invalid JSON", http.StatusBadRequest)
return
}
token, err := getValidToken()
if err != nil {
http.Error(w, "Auth error", http.StatusInternalServerError)
return
}
ssml := buildSSML(payload.Context)
audioURL, err := generateAudio(ssml, token)
if err != nil {
http.Error(w, "Audio generation failed", http.StatusInternalServerError)
return
}
payload.Context["ttsAudioURL"] = audioURL
resp := CognigyResponse{
Context: payload.Context,
Output: map[string]interface{}{
"platform": "voice",
"media": map[string]interface{}{"url": audioURL},
"actions": []string{"play"},
},
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(resp)
}
func main() {
http.HandleFunc("/webhook/tts", webhookHandler)
log.Println("Listening on :8080")
if err := http.ListenAndServe(":8080", nil); err != nil {
log.Fatalf("Server failed: %v", err)
}
}
Common Errors & Debugging
Error: 401 Unauthorized
- What causes it: The Cognigy platform rejects the webhook due to a missing or mismatched
X-Cognigy-Signatureheader, or the TTS provider rejects the Bearer token. - How to fix it: Verify the webhook secret matches the value configured in the Cognigy console. Ensure the Azure OAuth client credentials have the
cognitiveservices.azure.com/.defaultscope assigned. - Code showing the fix: The
verifySignaturefunction compares the incoming header against a locally computed HMAC. If verification fails, the handler returns 401. Update thewebhookSecretconstant to match the platform configuration.
Error: 403 Forbidden
- What causes it: The Azure Speech resource denies the token due to missing role assignments or region mismatch. The TTS endpoint URL must match the resource region exactly.
- How to fix it: Assign the
Cognitive Services Speech Userrole to the service principal. Verify thettsEndpointregion prefix matches the Azure resource location. - Code showing the fix: Replace
westusinttsEndpointwith your actual resource region. Confirm the token request uses the correct tenant ID.
Error: 429 Too Many Requests
- What causes it: The TTS provider enforces per-minute request limits. Concurrent voice flows can trigger cascading rate limits.
- How to fix it: Implement exponential backoff with jitter. The
generateAudiofunction includes a retry loop that sleeps between attempts. Increase theTimeouton the HTTP client if network latency contributes to false positives. - Code showing the fix: The retry block checks
resp.StatusCode == http.StatusTooManyRequestsand appliestime.Sleepbefore the next attempt.
Error: 500 Internal Server Error (SSML Malformed)
- What causes it: Unescaped XML characters in conversation context break the SSML parser. The TTS engine rejects invalid markup.
- How to fix it: Sanitize all dynamic inputs before interpolation. The
buildSSMLfunction replaces<,>, and&with XML entities. Add strict schema validation if context payloads contain unexpected types. - Code showing the fix: Wrap context values with
strings.ReplaceAllor a dedicated sanitization function before string formatting.