Hearing to Translate: Multimodal Speech Translation

Updated 20 December 2025

Hearing to Translate is a computational framework that integrates ASR, MT, LLMs, and TTS to convert spoken input into translated output while retaining speaker attributes and prosody.
It compares cascaded systems and direct end-to-end SpeechLLMs using metrics like WER, BLEU, MOS, and spatial fidelity to evaluate latency and performance in real time.
The paradigm supports diverse applications, from multilingual communication and hearing-impaired assistive tech to spatial telepresence with binaural audio rendering.

Hearing to Translate encompasses the computational, linguistic, and perceptual methods by which spoken language input is processed and mapped to a translation in another language, often in real time, and increasingly with preservation of speaker attributes, spatial cues, or prosodic information. This paradigm integrates advances in automatic speech recognition (ASR), machine translation (MT), LLMs, and expressive text-to-speech (TTS) synthesis. Modern systems operate in both cascaded and end-to-end configurations and are being extended to serve multilingual, hearing-impaired, and spatial telepresence contexts. This article reviews current methodologies, system architectures, mathematical and algorithmic principles, evaluation regimes, and emergent engineering and research challenges.

1. System Architectures and Integration Paradigms

Hearing to translate systems are chiefly realized as: (a) cascaded architectures—serial ASR, sentence segmentation/cleanup via LLMs, MT (LLM), and TTS, often with voice cloning—and (b) direct end-to-end (E2E) models or SpeechLLMs, where speech is directly projected into an LLM or decoder space for translation.

Cascaded Systems

A canonical example is the open-source pipeline comprising:

Voice Activity Detection (VAD) (e.g., Silero VAD, 30× RT CPU),
ASR (e.g., Whisper-large-v3.turbo, 1.55B, median WER 4.5% Europarl EN→ES),
Sentence segmentation/cleaning LLM (e.g., LLaMA-3.3-70B-Instruct),
MT LLM (same backbone, fine-tuned for multilingual direction),
Non-autoregressive TTS (MeloTTS), with voice cloning via full-model retraining (30 min target audio, 56 h A100 training, MOS 4.2).

Average end-to-end latency is 2.5 s (peaking at 5 s in heterogeneous speech), with component times: VAD <0.1 s, ASR 0.8 s, segmentation 0.6 s, translation 0.7 s, TTS 0.4 s. Hardware deployment is feasible for ASR & TTS on a mid-range GPU; LLMs may require cloud resources (Cámara et al., 3 Jul 2025).

SpeechLLMs and Direct E2E Models

SpeechLLMs extend LLMs with a speech encoder and a modality adapter (MA), e.g. Phi-4-Multimodal, Qwen2-Audio, Voxtal, Spire (Papi et al., 18 Dec 2025). Architectures typically comprise:

Speech encoder (Conformer, Whisper, HuBERT),
Adapter: linear/MLP or Q-former, projecting speech encoder output to LLM embedding space,
(Optionally) hybrid inclusion of ASR transcript alongside embeddings (DeSTA2).

Training often involves initial ASR pre-alignment (adapter + LLM frozen), then supervised ST, sometimes followed by DPO. Direct models deliver robust performance in noisy, code-switched, or disfluent speech, but, as of current benchmarks, generally do not outperform strong cascades on aggregate ST quality (Papi et al., 18 Dec 2025).

Specialized and Assistive Configurations

Multimodal and assistive systems for impaired speech use audio-visual “Omni-Model” architectures, fusing high-frame-rate lip video (patchified, vision transformer encoded, 3D-resampled) with speech features in a shared LLM (e.g., HI-TransPA) (Ma et al., 13 Nov 2025).

Spatial telepresence setups employ binaural headsets with joint source separation, localization, and real-time S2T/T2S pipelines, rendering translated speech at correct azimuths (preserving ITD/ILD cues), supporting multi-speaker environments (Chen et al., 25 Apr 2025, Geleta et al., 12 Nov 2025).

2. Mathematical and Algorithmic Foundations

Cascaded Pipeline Metrics and Equations:

End-to-End Latency: $L_{\rm total}=L_{\rm VAD}+L_{\rm ASR}+L_{\rm seg}+L_{\rm trans}+L_{\rm TTS}$ (average 2.5 s).
Word Error Rate (WER): $\mathrm{WER} = \frac{S+D+I}{N_\text{ref}}$ (median 4.5%).
BLEU Score: $\mathrm{BLEU} = \mathrm{BP}\cdot\exp(\sum_{n=1}^N w_n \log p_n)$ (median 0.5).
COMET Score: (0.75 median).
Voice Cloning MOS: (1–5 scale): average 4.20 (Cámara et al., 3 Jul 2025).

Speech-to-Text Model Internals:

E2E models use hierarchical bidirectional LSTM encoders (3-layer for speech; frame subsampling yields tractable sequence lengths) with additive or convolutional attention; decoder is unidirectional LSTM. Training minimizes token-level cross-entropy. Attention learns implicit alignment in the absence of source transcripts, enabling transcript-free ST—achieving 90% of pipeline BLEU on synthetic data with small held-out test sets (Berard et al., 2016).

Modality Adapter and Representation Dynamics:

In SpeechLLMs, the adapter transforms the speech-encoder output $h_\text{speech}\in\mathbb{R}^{d_s}$ via $z = W h_\text{speech} + b$ ( $W\in\mathbb{R}^{d_\text{LM}\times d_s}$ , $b\in\mathbb{R}^{d_\text{LM}}$ ). Whether this produces a semantic interlingua (via Whisper-trained encoders, yielding high SpokenSTS) or phonetic representations (ASR-only, mapping to English orthographic tokens for source phones) is determined by encoder supervision (Ògúnrèmí et al., 2 Oct 2025).

Spatial and Binaural Rendering:

Rendering is executed by convolving translated speech $s(t)$ with HRTFs: $y_L(t) = \sum_{\tau} h_L(\tau;\theta) s(t-\tau)$ etc. Additional reverberation may be simulated; the native channel is attenuated (e.g., $x_\text{native}'(t) = x_\text{native}(t)\cdot 10^{-18/20}$ ) to foreground translation. In spatial ST for hearables, accurate preservation of ITD/ILD through BRIR-based separation and HRTF-convolved rendering enables listeners to localize speakers and maintain scene integrity, even in noisy or reverberant environments (Chen et al., 25 Apr 2025, Geleta et al., 12 Nov 2025).

3. Evaluation Protocols and Empirical Benchmarks

Benchmarks and Task Diversity:

The comprehensive “Hearing to Translate” suite (Papi et al., 18 Dec 2025) assesses:

5 SpeechLLMs, 16 cascades (SFM+LLM), 4 pure SFMs,
16 benchmarks spanning FLEURS, CoVoST2, EuroParl-ST, WMT (generic), WinoST (gender bias), CommonAccent, ManDi (accent), CS-Dialogue, CS-FLEURS (code-switch), LibriStutter (disfluency), NEuRoparlST (named entities), NoisyFLEURS (noise), EmotionTalk/mExpresso (emotion/style), ACL60/MCIF (long-form).

Metrics:

xCOMET-QE_S, MetricX-QE_S: reference-free QE metrics (higher is better, remapped for comparability).
Gaps: Gender (Δ), accent, and length (computed as percentage drops).
Named-entity accuracy: $A_{\mathrm{NE}} = M_{\mathrm{NE}}/|\mathrm{NE}|$ .
BLEU: for legacy and comparability.
MOS: subjective listening for TTS, spatial clarity and immersion (e.g., MOS 4.20 on cloned voices; spatial rendering doubled comprehension to 75.8% vs. 59.5% diotic) (Geleta et al., 12 Nov 2025).

Key Findings:

Best cascades achieved xCOMET ~93 (Seamless+Aya, Whisper+Gemma3, Canary+Aya, Voxtral); top SpeechLLM (Voxtral) ~92.7, pure SFMs lower.
SpeechLLMs exhibit stronger robustness under noise, code-switching, and disfluency; cascading prevails in overall quality, emotion, and long-form content (Papi et al., 18 Dec 2025).

4. Representation, Adaptation, and Error Propagation

Representation Dynamics:

Whisper-based encoders, with translation supervision, yield semantic English-based interlingua representations in MAs—enabling generalization to unseen input languages. Recognition-only encoders produce phonetic representations in English tokens, aiding transliteration or code-switched handling but limiting semantic transfer (Ògúnrèmí et al., 2 Oct 2025).

Model	Phone Acc	Word Acc	SpokenSTS (ρ)
Whisper enc.	84.3%	52.9%	0.47
SALMONN MA	69.7%	38.8%	0.63
Qwen2 enc.	84.3%	78.3%	0.09
Qwen2 MA	74.2%	69.7%	0.13

In these results, MA outputs for Whisper-based models lose phone/word classability but gain in semantic similarity, consistent with their use as an interlingua.

Error Propagation and Correction:

Cascaded architectures are vulnerable to ASR errors propagating into MT, while E2E and direct SpeechLLM models potentially reduce error cascades, as disfluencies and non-linguistic utterances may be resolved in the latent space—especially apparent under noisy conditions. However, the lack of modularity in direct systems may limit targeted error correction or domain adaptation.

5. Real-Time, Multimodal, and Spatial Extensions

Deployment and Human-Factors Engineering:

Modern open-source systems enable both local and cloud API deployment, Bluetooth/FM audio dissemination, and integration with teleconferencing (Zoom via virtual audio devices). Spatial audio setups leverage binaural/virtual-source rendering, timbre differentiation for speaker identification, and dynamic gain control on original tracks (Cámara et al., 3 Jul 2025, Geleta et al., 12 Nov 2025).

Assistive and Multimodal Processing:

HI-TransPA (Ma et al., 13 Nov 2025) fuses audio and stabilized high-frame-rate lip videos, leveraging curriculum learning and a SigLIP+3D-Resampler vision stack, with metrics balancing character error rate and embedding similarity. It demonstrates that curriculum learning and lip-specialized modeling markedly improve comprehensive ST scores over audio-only or naive multimodal approaches.

Spatial Speech Translation for Hearables:

Systems such as those presented in “Spatial Speech Translation: Translating Across Space With Binaural Hearables” (Chen et al., 25 Apr 2025) demonstrate that joint source separation, localization, and expressive ST can process multi-speaker, noisy binaural input in real time (RTF < 1), rendering translated speech at correct azimuths. User studies show retained localization accuracy, spatial scene fidelity, and BLEU scores up to 22.07 in real-world interference.

6. Limitations, Open Problems, and Future Directions

Cascaded SFMs + LLMs constitute the most reliable translation approach overall (Papi et al., 18 Dec 2025). SpeechLLMs are competitive under adverse conditions but require improved handling of emotion, long-form structure, and rare linguistic phenomena. Pure SFMs lag considerably, emphasizing the criticality of LLM integration. Key limitations:

Tradeoff between on-device speed and translation accuracy, necessitating further advances in model compression, distillation, and architectural design (Chen et al., 25 Apr 2025).
Residual error propagation in both modular and E2E models under high noise/reverberation.
Insufficient representation and modeling for prosody, emotion, and speaker individuality (especially cross-lingually translated TTS).
Limited corpus size and diversity for low-resource languages and multimodal (e.g., lip+speech) datasets.

Research directions include intensified pretraining on mixed-modality data, robust adapter and alignment mechanisms, expansion of dataset coverage for under-represented phenomena, real-time adaptation to auditory environment and multi-party context (dynamic speaker tracking/spatialization), and further development of assistive models integrating corrective feedback (e.g., for speech therapy, sign language, and progress monitoring) (Papi et al., 18 Dec 2025, Ma et al., 13 Nov 2025). Gradual refinement of spatial audio rendering and cross-modal co-adaptation is anticipated to underpin future advances in naturally immersive, accessible, and accurate hearing-to-translate systems.