SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation (2505.17060v1)

Published 17 May 2025 in cs.CL and cs.AI

Abstract: In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30\% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository https://github.com/bytedance/SALMONN.

Summary

The paper introduces SALMONN-omni, a novel standalone speech LLM that achieves full-duplex conversation without relying on codec injection, simplifying architecture.
SALMONN-omni employs a dynamic thinking mechanism within its LLM to manage speaking and listening states and handle multiple speech streams in a unified autoregressive manner.
Experimental results demonstrate over 30% performance improvement on spoken QA and open-domain dialogue benchmarks, effectively managing turn-taking and barge-ins.

An Examination of SALMONN-omni: Advances in Full-duplex Speech LLMs

The development of fluid and natural human-machine speech interaction has historically encountered challenges related to context-dependent barge-ins and echo cancellation, obstacles compounded by the modular architectures of existing conversational systems. In the paper titled "SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation," a pioneering approach is presented, aimed at circumventing the limitations imposed by codec injection while achieving notable performance improvements in full-duplex communication scenarios.

Methodological Innovation

SALMONN-omni introduces a dynamic thinking mechanism within its LLM structure, which equips the model with the capability to autonomously navigate between speaking and listening states. This innovation marks a substantial departure from previous systems, highlighting the limitations of those reliant on audio codecs injected into LLM token spaces, as exemplified by models such as Moshi, SyncLLM, and OmniFlatten. The replacement of the codec-dependent approach with a codec-free framework allows SALMONN-omni to elegantly handle multiple speech streams in a unified autoregressive fashion, incorporating temporal awareness and achieving synchronous dialogue state transitions.

Experimental Validation

The experimental outcomes prominently demonstrate SALMONN-omni’s superiority over competing full-duplex models, highlighting at least a 30% improvement in performance on benchmark datasets for spoken question answering (QA) and open-domain dialogue tasks. Notably, the model's success extends to complex conversational dynamics, managing turn-taking, backchanneling, and context-aware barge-ins with confidence. Furthermore, reinforcement learning (RL) is leveraged to refine the model's understanding and execution of dialogue dynamics, establishing a new standard for full-duplex speech LLM efficiency and response quality.

Comparative Advantages

A distinguishing feature of SALMONN-omni is its standalone operation without reliance on auxiliary modules, contrasting with models like VITA and Freeze-Omni, which necessitate additional processes to accommodate simultaneous listening and speaking tasks. This streamlined architecture mitigates the issues of added complexity and error propagation, which frequently undermine modular systems.

Implications and Future Directions

SALMONN-omni’s advancements yield substantial practical implications for the deployment of conversational AI systems. Its codec-free modality challenges the conventional dependencies on extensive datasets and complex pre-training mechanisms, offering promising scalability and adaptability for diverse speech interaction scenarios. The integration of RL in refining dialogue dynamics suggests a compelling avenue for ongoing research, inviting exploration into adaptive learning and personalization in conversational AI frameworks.

The paper opens pathways for future inquiry into optimizing speech LLMs, particularly in enhancing emotional expressiveness and refining prosodic controls, which are essential for genuine human-like interactions. These efforts may further refine SALMONN-omni’s application across varied domains, from virtual assistants to automated customer service systems, underscoring its potential in advancing intelligent and empathetic machine communication.

In conclusion, the SALMONN-omni model represents a significant step forward in the pursuit of authentic, interactive speech communication. As researchers strive to refine and replicate its achievements, the innovations presented in this paper herald promising developments for the AI community at large.

Related Papers

GitHub

GitHub - bytedance/SALMONN: SALMONN: Speech Audio Language Music Open Neural Network (1,237 stars)

YouTube

Show All Videos