- The paper presents Hibiki, a decoder-only nested Transformer model that synchronizes text and audio token predictions for real-time translation.
- It employs adaptive contextual alignment using perplexity from pre-trained models to optimize low-latency and accurate translations.
- Hibiki achieves superior BLEU scores, improved speaker similarity, and naturalness while enabling scalable real-time and mobile deployment.
High-Fidelity Simultaneous Speech-to-Speech Translation with Hibiki
The paper "High-Fidelity Simultaneous Speech-To-Speech Translation" introduces Hibiki, a significant advancement in the domain of simultaneous speech translation (S2ST). Hibiki presents a decoder-only model capable of both speech-to-text (S2TT) and speech-to-speech (S2ST) translation using a multistream LLM architecture. This approach addresses the inherent complexities of simultaneous interpretation, particularly the need for models to adaptively accumulate sufficient context to ensure accurate real-time translation.
Methodology and Architectural Innovations
Hibiki leverages a nested Transformer architecture that synchronizes the processing of source and target speech streams. It operates by predicting hierarchies of text and audio tokens, thus facilitating the concurrent production of translated text and speech. Notably, the model embodies a "decoder-only" structure, optimizing the efficiency of this simultaneous translation task.
A critical innovation within Hibiki's architecture is its adaptive "contextual alignment" technique. This method utilizes the perplexity from a pre-existing text translation model to identify optimal word-alignments—ensuring translation accuracy with minimal latency. This alignment process is significant as it allows Hibiki to dynamically modulate the translation flow without complex inference mechanisms, which are common in other models.
Another noteworthy aspect of Hibiki is its handling of training data: it synthesizes audio-text pairs by translating and re-synthesizing monolingual corpora, embodying supervised learning principles while maintaining alignment at the lexical level. Moreover, Hibiki employs a unique conditioning mechanism on training data to adjust voice similarity, enhancing the preservation of speaker characteristics through classifier-free guidance strategies.
The evaluation framework for Hibiki involves rigorous benchmarks across numerous metrics, including translation accuracy, speaker similarity, naturalness, and latency. The model is benchmarked against existing state-of-the-art models like Seamless and StreamSpeech across short-form (CVSS-C dataset) and long-form (Audio-NTREX corpus) speech data.
In terms of translation quality, Hibiki achieves an ASR-BLEU score surpassing prior models, both in simultaneous and offline settings. Notably, in long-form tests, Hibiki provides substantially higher BLEU scores compared to competing models. Human raters also corroborated Hibiki's superiority in speaker similarity and audio naturalness, as indicated by higher mean opinion scores relative to Seamless, known for its expressive capabilities.
Hibiki's architecture also confers scalability advantages: its simple inference process facilitates batched processing across hundreds of sequences on standard GPUs—making it suitable for real-time applications. A distilled version of the model, Hibiki-M, demonstrates the feasibility of real-time inference on mobile devices, marking significant scalability and deployment advantages.
Theoretical and Practical Implications
The design and implementation of Hibiki suggest several implications and future trajectories for AI and linguistic applications. Practically, Hibiki's model architecture and methods offer a blueprint for developing other real-time translation systems with significant emphasis on speaker fidelity. Theoretically, Hibiki's integration of multistream modeling and contextual alignment enriches the discourse on real-time neural machine translation, showcasing adaptable model architectures to address translation latency and alignment without compromising quality.
The research establishes a precedent for incorporating weak supervision methods in the alignment of translated speech, which could inspire further research into more linguistically diverse and technically demanding translation tasks. Additionally, Hibiki's results underscore the potential of leveraging advanced codec technologies in S2ST, opening the door for applications involving cross-lingual dialogue systems and broader multimodal translation settings.
Conclusion
Hibiki represents a substantial step in the evolution of simultaneous speech translation models, integrating efficiency and quality across translation modalities while maintaining voice fidelity and naturalness approximations to human interpreters. This model not only advances the capabilities of machine translation but also sets a precedent for expansive applications in multilingual communication and interpretation, both online and in deployed systems.