Voxtral Small: Multimodal Audio Chat Model
- Voxtral Small is a multimodal audio chat model that integrates speech and text processing using a three-part Transformer architecture for extended context handling.
- It employs a Whisper-based audio encoder with an efficient adapter layer and a Mistral decoder to achieve high accuracy in transcription, translation, and speech understanding tasks.
- Its training strategy, combining pretraining, supervised fine-tuning, and direct preference optimization, enables long-format, multi-turn dialogues with efficient local execution.
Voxtral Small is a multimodal audio chat model designed to process both speech and text inputs, while preserving high performance in language understanding and generation. Developed as one of two models in the Voxtral series, it employs a Transformer-based architecture to achieve state-of-the-art results across diverse audio and text benchmarks, with a focus on efficient local execution and support for extended context windows up to 32,000 tokens. This enables the model to process audio files up to 40 minutes in duration and engage in long-format, multi-turn conversations.
1. Architectural Overview
Voxtral Small follows a three-part Transformer-based architecture, optimized for multimodal understanding:
- Audio Encoder:
The audio encoder is based on the Whisper large‑v3 architecture. It converts raw waveform inputs into log-Mel spectrograms using 128 Mel bins and a hop length of 160. A convolutional stem downsamples the spectrogram by a factor of 2, followed by bidirectional self-attention layers, producing audio embeddings at a temporal resolution of 50 Hz. The architecture mitigates Whisper’s inherent 30-second receptive field by segmenting longer audio inputs into independently processed 30-second chunks, each with positional encodings reset to handle chunking artifacts, before reassembly.
- Adapter Layer:
An MLP-based adapter reduces the sequential length of the audio embeddings by a factor of 4x, yielding a 12.5 Hz frame rate. This specific downsampling aligns the information content per audio token with that of typical text tokens in the language decoder, controlling memory and compute costs while preserving performance.
- Language Decoder:
The decoder leverages the Mistral Small 3.1 24B backbone, generating responses autoregressively. It is conditioned on the fused audio-text encodings, enabling flexible reasoning over multimodal context. The overall parameterization is approximately 24.3 billion parameters, segmented as follows:
| Module | Parameter Count | |--------------------|----------------| | Audio Encoder | 640M | | Adapter | 52M | | Text Embeddings | 670M | | Language Decoder | 22.9B |
2. Training Regime
The training process for Voxtral Small consists of three sequential phases:
- Pretraining:
- Audio-to-text repetition: Each audio segment is paired directly with its corresponding transcription , facilitating conventional speech recognition.
- Cross-modal continuation: Here, is followed by the subsequent text , with special tokens
<repeat>
and<next>
used to disambiguate patterns. This pattern enhances the model's capacity for question answering and dialogue over audio.
- Supervised Finetuning:
The model is further finetuned on both real and synthetic datasets, incorporating tasks with audio as context for answering text queries (e.g., summarizing long-form audio into QA pairs) and tasks requiring direct audio responses. A dedicated “transcribe mode,” triggered by a special token, is used to fine-tune transcription specialization.
- Preference Alignment:
Direct Preference Optimization (DPO), including online variants, is applied using pairwise response preferences. Candidate responses are sampled at temperature . A dedicated reward model, operating on transcribed audio, helps steer the model toward grounded and non-hallucinatory outputs.
3. Performance and Efficiency
Voxtral Small demonstrates competitive or state-of-the-art results across a range of benchmarks:
- Speech Recognition:
Achieves low word error rates (WER) on English short-form (LibriSpeech) and multilingual (Mozilla Common Voice, FLEURS) benchmarks. Surpasses both open- and closed-source models such as GPT‑4o mini Audio and Gemini 2.5 Flash on multiple datasets.
- Speech Translation:
On the FLEURS Speech Translation benchmark, the model obtains consistently highest BLEU scores across all source–target language pairs.
- Speech Understanding:
Matches or outperforms leading closed systems in spoken question answering tasks, including Llama QA, Openbook QA, and internally developed Speech Understanding (SU) benchmarks, surpassing GPT‑4o mini Audio on several evaluations.
- Text-Only Tasks:
Maintains text performance commensurate with the Mistral Small 3.1 baseline, supporting seamless operation on both audio and text tasks.
Optimizations such as adapter downsampling ensure that the 24.3B parameter footprint is manageable for local deployment, with no excessive computational cost for extended audio sequences.
4. Long-Context Processing and Dialogue Management
A defining characteristic of Voxtral Small is its 32,000-token context window:
- Long Audio Handling:
The model manages audio inputs up to 40 minutes (32K tokens) by splitting audio into 30-second chunks for parallel processing and reassembly. Absolute positional encodings are reset per chunk to maintain coherence over extended durations.
- Multi-Turn Conversations:
The same context window enables uninterrupted, multi-turn audio–text dialogue. Persistent context retention is particularly valuable for complex reasoning tasks and extended user interactions, as the full dialogue history can be maintained.
5. Technical Specifications and Modal Bridging
Technical details underpinning Voxtral Small’s efficiency and flexibility include:
- Downsampling via Adapter Layer:
Reduces memory and compute by mapping 50 Hz audio embeddings to 12.5 Hz, balancing sequence length against representational sufficiency.
- Pretraining Patterns:
The two pretraining paradigms—repetition and continuation with <repeat>
and <next>
tokens—are essential in equipping the model for both transcription and audio understanding.
- Parameter Allocation:
The majority of parameters reside in the language decoder (22.9B), with significant but smaller contributions from audio processing and embedding modules.
- Training Pair Notation:
Pairs can be denoted as or , with the respective pattern signaled by the relevant special token.
6. Benchmarking and Evaluation
Voxtral Small is assessed on a comprehensive suite of benchmarks:
- Standard Benchmarks:
Conventional datasets such as LibriSpeech, Mozilla Common Voice, and FLEURS are used for transcription and translation.
- Speech-Synthesized Benchmarks:
Popular text-based evaluations (GSM8K, TriviaQA, MMLU) are rendered as spoken audio via text-to-speech systems to test speech comprehension.
- Internal Speech Understanding Benchmarks:
- LLM_judge_score: a binary helpfulness metric.
- grade_LLM_judge_score: a graded scale from 0 to 5.
Results indicate that Voxtral Small often outperforms GPT‑4o mini Audio on speech understanding, with higher judge and grade metrics, while retaining strong results in transcription and translation tasks.
7. Summary and Significance
Voxtral Small integrates a Whisper-based audio encoder, efficient adapter-based downsampling, and a Mistral-based language decoder, supporting extended audio reasoning and dialog with a 32K token context. The incorporation of dual pretraining patterns, specialized finetuning, and Direct Preference Optimization aligns the model for high fidelity performance across transcription, translation, and speech understanding. Its architecture and training regime position it as an efficient, locally deployable model that is competitive with both open and closed-source alternatives in multimodal language and audio processing (Liu et al., 17 Jul 2025).