- The paper proposes an MLLM-based architecture that unifies audio and text modalities, decisively outperforming contrastive-only models.
- It leverages large-scale, diverse datasets and autoregressive objectives to achieve over 20% higher recall on compositional and context-dependent queries.
- The approach demonstrates robust zero-shot transfer and scalability, offering significant improvements for open-domain audio-text retrieval tasks.
Scaling Audio-Text Retrieval with Multimodal LLMs
This paper addresses the challenge of scaling audio-text retrieval in complex open-domain settings through the integration of multimodal LLMs (MLLMs). Prior progress in audio-text retrieval has largely relied on shallow contrastive learning frameworks, where independent encoders map audio and text into a shared embedding space. These models, though effective for simple semantics, encounter clear limitations in handling semantically compositional, context-dependent, and linguistically rich queries—an increasingly frequent need due to the proliferation of diverse audio resources and the rise of complex downstream tasks such as sound event retrieval, open-ended audio question answering, and audio caption grounding.
The motivation for leveraging MLLMs stems from their demonstrated capabilities in vision-language tasks, where autoregressive next-token prediction and explicit context modeling lead to semantically richer and more compositional representations. The hypothesis is that similar advances can be realized in audio-text retrieval if recent MLLM architectures are adapted and effectively scaled.
Methodology
The authors propose a scalable MLLM-based approach that unifies audio and language modalities at the representation and retrieval levels. The pipeline consists of several core innovations:
- Unified MLLM Architecture: The architecture augments existing LLMs with an audio encoder (using contemporary backbone models, e.g., audio transformers or PANNs), an adapter for modality projection, and cross-modal fusion modules. Rather than shallow latent alignment, the approach relies on masked language/audio modeling and autoregressive objectives to more deeply align semantics.
- Large-Scale Training Datasets: Training leverages a comprehensive set of weakly and strongly labeled audio-text pairs, including large curated and crawled resources, enabling high coverage of diverse audio events, environments, and compositional queries.
- Zero-shot and Compositional Reasoning: The retrieval objective includes explicit benchmarks for compositionality, long-context understanding, and query complexity, measuring the ability of the system to ground fine-grained descriptions in long-form and ambiguous queries.
- Contrastive and Generative Objectives: In addition to standard InfoNCE or supervised contrastive losses for alignment, the model incorporates next-token generation and denoising objectives on joint audio-text sequences, encouraging the learning of shared structure and deeper modality interplay.
Empirical Findings
The MLLM-based retrieval models are comprehensively evaluated against established and new state-of-the-art contrastive approaches, including CLAP [clap], M2D-CLAP [m2dclap], and Cacophony [cacophony], as well as recent MLLM-based embedding models such as Qwen3-VL-Embedding [qwen3vlembed], UniME [unime], and Lamra [lamra].
Key results established in the paper:
- On standard benchmarks (AudioCaps [audiocaps], Clotho [clotho], ESC-50k [esc50k], VGGSound [vggsound]), the proposed approach sets new performance records in both text-to-audio and audio-to-text retrieval, with clear margins over prior SOTA.
- The model is notably strong on compositional and long-context queries: for example, in adversarially-composed and compositional splits, it achieves >20% higher recall rates versus shallow models.
- The scaling trend with more parameters and training data is monotonic and substantial, with evidence that current parameterization and dataset size have not yet saturated performance, indicating continued scale incentives.
- Detailed ablations confirm the effectiveness of cross-modal autoregressive objectives and the necessity of modality-adaptive pretraining compared to naive concatenation or shallow fusion.
- The model demonstrates robust zero-shot transfer to out-of-domain datasets and novel sound categories.
Significant Claims and Contradictions
- MLLM-based architectures decisively outperform contrastive-only models, particularly on context-dependent, compositional, and long-audio queries.
- Simple scaling of unimodal contrastive models plateaus early, whereas MLLMs exhibit continued improvement with scale, echoing trends in vision-language domains.
- Previous assumptions regarding the sufficiency of pairwise contrastive objectives for open-ended retrieval are refuted in the context of complex, long-tailed audio semantics.
Practical and Theoretical Implications
The results substantiate the claim that MLLMs with explicit sequence modeling heads and compositional training objectives are the new paradigm for cross-modal retrieval at scale. This has key implications:
- Modeling: Retrieval systems can exploit deeper context, semantics, and compositional structure, improving recall and precision for complex, ambiguous queries.
- Data: There is an empirical incentive to continually expand and enrich multimodal datasets, particularly with fine-grained and compositional supervision signals.
- Applications: Downstream tasks such as video grounding, AVQA, and open-ended event detection are likely to benefit from MLLM-driven retrieval, especially in zero-shot and generalization regimes.
- Interpretability and Alignment: The sequence-based modeling and generative heads of MLLMs offer opportunities for more interpretable matching scores and retrieval rationales.
Future Developments
The demonstrated scale trends strongly suggest that further increases in model size, training breadth, and improved multimodal data curation will yield continued gains. Open problems include efficient adaptation and personalization, low-resource generalization, and more explicit handling of temporal structure in audio modalities. Future MLLM architectures that natively handle more modalities (e.g., video, multi-channel audio, sensor data) and leverage emergent world knowledge are likely next steps in universal cross-modal retrieval.
Conclusion
This paper provides a systematic evaluation of scaling laws and architectural choices for audio-text retrieval with MLLMs, establishing the clear superiority of autoregressive, sequence-based, multimodal models over shallow contrastive approaches on a broad suite of tasks, especially in compositional and open-ended scenarios. The evidence calls for a paradigm shift in the design of retrieval systems for audio and aligns with similar trends in vision-language research, with strong implications for the future design and deployment of multimodal AI systems (2602.18010).