Hello everyone! We are developing a new AppFoundry integration that utilizes the AudioHook API for real-time sentiment analysis. While the raw audio stream is performing as expected, we are observing significant latency in the metadata events that provide the speaker identification and barge-in signals. This delay is causing our sentiment engine to associate the audio with the wrong participant during rapid exchanges. Is there a way to synchronize the metadata stream more tightly with the RTP packets, or should we be calculating the offsets manually within our websocket server?
Hey. Welcome to the AudioHook struggle. I have helped a few partners with this.
The metadata events are sent as part of the binary websocket stream, but they are not perfectly frame-aligned with the audio payloads. You definitely have to use the timestamp field in the metadata and correlate it with the sequence of the audio frames on your side.
It is the only way to get sub-100ms accuracy.
I am just getting started with Agent Assist but this sounds like a serious problem for any real-time AI. If the speaker identification is wrong, the suggestions will be useless. the previous poster, are you seeing this more on BYOC Cloud or is it also happening on Premise Edges? I wonder if the network jitter is affecting the websocket more than the standard RTP.
I have been testing this with the Python SDK and the results are consistent. The platform prioritizes the audio delivery to ensure there is no clipping, which sometimes means the metadata frames are slightly buffered. If you are doing speaker separation, you must implement your own jitter buffer for the metadata on your server.
Do not rely on the platform to deliver them in perfect sync with the audio.