Robust handling of mid-sentence self-corrections in continuous multi-step tool execution

Develop real-time voice agent architectures and inference strategies that can reliably handle mid-sentence self-corrections in continuous speech by updating or rolling back parameters for multi-step API tool execution without sacrificing conversational latency or flow.

Background

FDB-v3 includes 21 scenarios that require recognizing and applying mid-utterance corrections to tool parameters. Across six evaluated systems—including GPT-Realtime, Gemini Live (2.5 and 3.1), Grok, Ultravox, and a cascaded Whisper→GPT-4o→TTS pipeline—self-corrections were the most challenging, with even the best system failing in over 40% of such cases.

The difficulty arises because models often commit intermediate parameters before corrections arrive, making reliable rollback or dynamic state updates essential yet currently underperforming. The authors explicitly state that handling these corrections remains unresolved for all current models.

References

Most importantly, FDB-v3 shows that handling mid-sentence corrections remains an open challenge for all current models.