Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion

Published 25 Feb 2026 in cs.CL | (2602.21646v1)

Abstract: Multimodal LLMs (MLLMs) have achieved notable success in enhancing translation performance by integrating multimodal information. However, existing research primarily focuses on image-guided methods, whose applicability is constrained by the scarcity of multilingual image-text pairs. The speech modality overcomes this limitation due to its natural alignment with text and the abundance of existing speech datasets, which enable scalable language coverage. In this paper, we propose a Speech-guided Machine Translation (SMT) framework that integrates speech and text as fused inputs into an MLLM to improve translation quality. To mitigate reliance on low-resource data, we introduce a Self-Evolution Mechanism. The core components of this framework include a text-to-speech model, responsible for generating synthetic speech, and an MLLM capable of classifying synthetic speech samples and iteratively optimizing itself using positive samples. Experimental results demonstrate that our framework surpasses all existing methods on the Multi30K multimodal machine translation benchmark, achieving new state-of-the-art results. Furthermore, on general machine translation datasets, particularly the FLORES-200, it achieves average state-of-the-art performance in 108 translation directions. Ablation studies on CoVoST-2 confirms that differences between synthetic and authentic speech have negligible impact on translation quality. The code and models are released at https://github.com/yxduir/LLM-SRT.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces a scalable speech-guided machine translation framework that fuses synthetic speech with text to achieve superior translation quality across 28 languages.
It leverages a multimodal large language model and a self-evolution loop that uses COMET metrics for iterative refinement while reducing human annotation.
Experimental results show robust BLEU and spBLEU improvements, especially for low-resource languages, outperforming significantly larger models.

Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion

Motivation and Framework Overview

The paper addresses inherent limitations in multimodal machine translation (MMT) stemming from image-guided approaches—namely, the scarcity of multilingual image-text data and restricted language coverage. By pivoting toward speech as a secondary modality, the authors exploit the natural alignment and abundance of multilingual speech datasets. The proposed Speech-guided Machine Translation (SMT) framework combines a high-capacity multimodal LLM (MLLM) with a state-of-the-art text-to-speech (TTS) system via a self-evolution mechanism, enabling scalable, efficient, and high-quality translation across 28 languages.

The SMT pipeline synthesizes speech from input text using TTS, then jointly processes the paired text and audio inputs in the MLLM. The framework iteratively improves through a self-evolution loop, wherein synthetic speech samples are classified by utility for translation (using COMET metrics) and used for continual model refinement.

Methodology

Modality-Agnostic Hypothesis

The authors formalize the modality-agnostic hypothesis for MMT: any auxiliary modality (speech, image, etc.) can enhance translation if it provides semantically relevant information and if its representation can be aligned and jointly optimized with text features in a shared latent space.

Multi-Stage MLLM Pre-training

The MLLM architecture leverages Whisper-large-v3 as the speech encoder, a Q-Former+MLP speech adapter, and GemmaX2-28-9B as the LLM backbone. The curriculum learning pipeline proceeds through:

ASR pre-training for speech-text alignment,
Speech-to-text translation for cross-lingual cross-modality mapping,
Joint speech-text machine translation.

Only adapter parameters are trainable, containing approximately 80.5M parameters; total model size is ~10B parameters.

Self-Evolution Mechanism

The self-evolution module consists of four phases:

Experience Acquisition: TTS synthesizes diverse speech samples from multilingual text.
Experience Refinement: Speech-text pairs are labeled as positive if joint modality input improves translation (S2 > S1 by COMET metrics), negative otherwise.
Model Updating: Continuous fine-tuning on positive samples to reinforce beneficial cross-modal interactions.
Evaluation: Performance measured on validation set; loop continues until convergence.

This framework autonomously generates and filters synthetic data, significantly reducing dependence on human annotation and improving generalization, especially in low-resource languages.

Experimental Results

Benchmarks and Metrics

Evaluations are conducted on Multi30K (image-text), FLORES-200 (general MT), CoVoST-2 (speech-text), and WMT24++ datasets. Metrics include BLEU, spBLEU (tokenized with flores200), and COMET for translation quality, assuring comparability to state-of-the-art baselines.

Performance Highlights

Multi30K (MMT): SMT-9B attains a BLEU of 47.0 for eng→deu and 67.0 for eng→fra, consistently surpassing text-only and image-guided MMT models. Average BLEU improvement is +2.1 over the previous SOTA.
FLORES-200 (MT): SMT-9B achieves SOTA performance in 108 directions, including low-resource pairs (e.g., khm, lao, mya). Average spBLEU scores (eng→xx) are 40.4, with COMET gains particularly pronounced in low-resource directions.
Notably, SMT-9B outperforms DeepSeek-V3-671B (67 times larger) despite its smaller scale, emphasizing the efficacy of modality fusion over brute model size.
Ablation Studies: Translation quality remains stable when using synthetic instead of authentic speech, due to reduced background noise and strong semantic consistency. The self-evolution mechanism provides further gains (+1.9 COMET in khm, +2.0 in lao, +1.7 in mya at round 3).
Reduction in under-translation errors is empirically validated via human evaluation (from 5.2% to 3.5%), indicating more effective attention alignment enabled by prosodic cues.

Implications and Future Directions

Practical Impact

The SMT framework substantially increases scalability and language coverage for MMT, transcending the limitations of image-based methods. The self-evolution mechanism reduces human annotation requirements and enables rapid adaptation to new languages and domains. The robustness to synthetic speech further enables deployment in environments where authentic speech recordings are unavailable.

Theoretical Insights

The work demonstrates that modality fusion (here, text and speech/prosody) can significantly enhance semantic alignment and translation faithfulness independent of model scale. This underscores the non-linear relationship between cross-modal integration and translation quality, motivating further exploration of rich multimodal signals beyond vision.

Future Developments

Extension to additional modalities (e.g., music, environmental audio) could further improve translation in domain-specific tasks (e.g., film subtitles).
Advances in open-source multilingual TTS will increase the reachable language set.
Automated discovery and alignment of prosodic features for further performance gains.
Integration of self-evolution with reinforcement learning or active learning for real-time improvement and adaptation.

Conclusion

The paper presents a robust, scalable, and generalizable Speech-guided Machine Translation framework leveraging self-evolution and synthetic speech. Empirical results across established benchmarks validate substantial improvements in translation quality, language coverage, and robustness, particularly in low-resource settings. The framework establishes a new paradigm for multilingual multimodal translation, with strong implications for efficient deployment and future modality extensions.

Paper: "Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion" (2602.21646)

Markdown Report Issue