We’re building a post-call summary feature for our custom agent desktop using the Embeddable Client App SDK. The requirement is to fetch the full voice-to-text transcript immediately after a call ends so we can display key moments to the agent.
The integration works fine for text-based channels like Web Messaging. I get the JSON blob with all the message history. But for voice calls, the transcript array comes back empty, even though the call is marked as completed and the Speech Analytics dashboard clearly shows the transcription was successful.
Here is the flow I’m following:
- Listen for the
conversation:endedevent via the WebSocket client. - Extract the
conversationIdfrom the event payload. - Wait 3 seconds to allow backend processing (we’ve tried waiting up to 30 seconds with no change).
- Call
GET /api/v2/analytics/speech/evaluationsto find the evaluation ID. - Call
GET /api/v2/analytics/speech/transcripts/{transcriptId}.
The initial query to find the evaluation looks like this:
var request = new AnalyticsSpeechEvaluationQueryRequest
{
DateRange = new DateRange { From = DateTime.UtcNow.AddMinutes(-5), To = DateTime.UtcNow },
ConversationIds = new List<string> { conversationId },
Types = new List<string> { "voice" }
};
var response = await client.AnalyticsApi.QueryAnalyticsSpeechEvaluationsAsync(request);
This returns a list of evaluations. I grab the first one and its transcriptId. Then I make the second call:
var transcriptResponse = await client.AnalyticsApi.GetAnalyticsSpeechTranscriptAsync(transcriptId);
The response status is 200 OK. The JSON structure looks valid, but the transcripts array is empty: { "transcripts": [] }.
I’ve checked the following:
- The conversation definitely had audio. I can hear it in the playback.
- The user token has the
analytics:report:readandspeech:analytics:readscopes. - The speech analytics feature is enabled for the specific queue this conversation belonged to.
- I’ve verified the
transcriptIdis not null or empty.
Is there a specific delay or a different endpoint I need to hit to get the actual text data? Or is the transcript generated asynchronously in a way that requires polling a different status field?
We’re running this against the US East region. The SDK version is the latest stable release.