Papers
Topics
Authors
Recent
2000 character limit reached

SEAL-RAG: Unified Speech-Text Retrieval Framework

Updated 17 December 2025
  • SEAL-RAG is a unified speech-text retrieval framework that eliminates the traditional ASR step by aligning speech and text into a shared embedding space.
  • It utilizes modality-specific encoders with a convolution+MLP adapter and a shared linear scaling layer to ensure consistent representation across modalities.
  • Empirical evaluations show a 7-point gain in Top-1 accuracy and a 50% reduction in latency compared to conventional ASR-retrieval pipelines.

SEAL-RAG refers to a unified embedding framework for speech-enabled retrieval-augmented generation, which directly aligns speech and text modalities in a shared semantic space to enable efficient multi-modal retrieval without reliance on intermediate automatic speech recognition (ASR) steps. This approach is designed to reduce pipeline latency and mitigate the error propagation typically seen in sequential ASR + retrieval architectures, primarily for speech LLMs (SLLMs) with retrieval-augmented generation capabilities (Sun et al., 26 Jan 2025).

1. Model Architecture

SEAL-RAG builds upon two modality-specific encoders: a speech encoder instantiated as Whisper-large-v3 [Radford et al., 2023] and a text encoder based on Piccolo-large-zh-v2 (a BERT-style 24-layer Transformer). The speech encoder processes raw audio inputs into log-Mel spectrograms, which are then passed through convolutional and Transformer layers to produce hsRT×dsh_s \in \mathbb{R}^{T \times d_s}. The text encoder processes tokenized input into hidden representations htRL×dth_t \in \mathbb{R}^{L \times d_t}, where dsdt=1024d_s \approx d_t = 1024.

A modality adaptation module consisting of a 1D temporal convolution (reducing time steps TTT \rightarrow T') and a two-layer MLP with GELU activation projects speech embeddings into the text encoder’s hidden size. Both modalities subsequently use a shared linear scaling layer S(u)=Wsu+bsS(u) = W_s u + b_s (with D=1024D=1024), followed by 2\ell_2-normalization to ensure representations lie on the unit hypersphere.

The resulting end-to-end embedding functions are: fspeech(xs)=S(MLP(Conv1D(Whisper(xs))))RDf_{speech}(x_s) = S\left(\mathrm{MLP}\left(\mathrm{Conv1D}(\mathrm{Whisper}(x_s))\right)\right) \in \mathbb{R}^D

ftext(xt)=S(BERT(xt))RDf_{text}(x_t) = S(\mathrm{BERT}(x_t)) \in \mathbb{R}^D

Both outputs are directly comparable in the unified space.

2. Training Objectives and Optimization

SEAL-RAG employs a two-stage training strategy:

  • Stage 1: Speech–Text Alignment

Given paired utterances (xs,xt)(x_s, x_t), the token-level alignment is enforced by minimizing mean squared error across token pairs:

Lpre=1TLi=1Tj=1Lz~s(i)zt(j)22L_{pre} = \frac{1}{T' L} \sum_{i=1}^{T'} \sum_{j=1}^L \|\tilde{z}_s^{(i)} - z_t^{(j)}\|_2^2

where z~s(i)\tilde{z}_s^{(i)} denotes adapted speech hidden states, and zt(j)z_t^{(j)} text hidden states.

For a speech query xsx_s, positive document d+d_+, and negatives {di}\{d_i\}, the InfoNCE retrieval loss is minimized:

Lret=log[exp(s(q,k+)/τ)exp(s(q,k+)/τ)+i=1Nexp(s(q,ki)/τ)]L_{ret} = -\log \left[ \frac{\exp(s(q, k_+)/\tau)}{\exp(s(q, k_+)/\tau) + \sum_{i=1}^N \exp(s(q, k_i)/\tau)} \right]

with cosine similarities s(u,v)s(u, v) and temperature τ=0.07\tau = 0.07.

The two-stage process ensures both local (token-level) and global (utterance/document-level) alignment in the shared space, with additional objectives such as cosent loss for STS and classification when relevant.

3. Theoretical Motivation and Design Principles

SEAL-RAG directly addresses fundamental limitations in ASR-mediated retrieval for SLLMs:

  • Acoustic variability (background noise, speaker diversity, sampling rates)
  • Speaker differences (accent, pitch, speaking rate)
  • Sequence length discrepancies (variable-length speech vs. fixed-length text)
  • Error propagation: ASR token errors can irretrievably degrade retrieval and downstream generation

Architectural elements mitigating these include: (1) pre-training on modality-specific data for robust low-level features, (2) the convolution+MLP adapter for acoustic pattern preservation and dimensionality matching, (3) the shared linear scaling layer for embedding space alignment, and (4) large-scale contrastive pre-training and fine-tuning to sharpen discrimination and eliminate cross-modal gap (Sun et al., 26 Jan 2025). The unified speech-text representation removes the ASR bottleneck, directly allowing speech queries to access retrieving capabilities at text-level precision and speed.

4. Retrieval-Augmented Generation Inference

At inference, all textual knowledge base documents are pre-embedded ki=ftext(di)k_i = f_{text}(d_i). Given a raw speech query, SEAL-RAG derives q=fspeech(xs)q = f_{speech}(x_s) in approximately 0.31 seconds. Approximate nearest-neighbor search (e.g., FAISS) retrieves the top–KK candidates maximizing cos(q,ki)\cos(q, k_i). Retrieved document texts are formatted and concatenated into the LLM prompt alongside the speech query representation, and the LLM then performs the generation in a single end-to-end pass. The elimination of the ASR step reduces pipeline latency by approximately 50% over conventional ASR+retrieval cascades.

5. Empirical Evaluation and Performance

SEAL-RAG was trained on 170k hours of public and proprietary speech (with strict signal quality filtering) and fine-tuned on 80k hours of synthetic, speaker-diverse speech. Evaluation spans CMTEB (35 Chinese embedding tasks) and a large multi-domain knowledge base retrieval test.

Key results include:

Method Time/query (s) Top1-Acc (%) Top3-Acc (%)
Text-Only 0.03 90.41 95.89
ASR→Text Pipeline 0.67 79.45 85.62
Project→Text 0.43 24.49 54.08
Align w/ CTC 0.31 78.08 84.25
SEAL-RAG 0.31 86.36 92.47

SEAL-RAG increases Top-1 accuracy by approximately 7 points and halves latency compared to the best ASR+retrieval pipeline. On CMTEB, systems using SEAL-RAG embeddings yield a 5.2 point gain relative to strong ASR+text baselines. Ablation experiments confirm that both stages—alignment pre-training and contrastive fine-tuning—are indispensable for peak retrieval accuracy.

6. Implementation Overview and Limitations

SEAL-RAG leverages a high-resource infrastructure (256 × NVIDIA V100 GPUs, mixed precision), with batch sizes of 8 per GPU, AdamW optimization, and linear warmup scheduling. The embedding projection dimension is fixed to D=1024D=1024 to ensure compatibility with typical vector search frameworks. Pre-training and fine-tuning are performed for 3 epochs each at learning rates 1×1051 \times 10^{-5} and 8×1068 \times 10^{-6}, respectively.

The primary limitation is that SEAL-RAG, by learning from acoustic-textual alignment on large-scale data, is highly dependent on the representativeness and quality of its training corpus for generalization to novel acoustic environments or underrepresented dialects. Additionally, direct mapping into a unified space imposes an upper bound on the expressivity and granularity that may be achievable compared to supervised, domain-specific adaptation. In settings where exhaustive enumeration of possible entities from long-form queries is required ("list all 20 works"), the fixed-k constraint inherited from the RAG paradigm imposes an inherent ceiling on recall.

7. Significance and Context within RAG Research

SEAL-RAG constitutes a paradigm shift for multi-modal RAG by decoupling retrieval from ASR and aligning speech with text at the representation level. The theoretical justification lies in removing an error-prone step (ASR) and directly leveraging distributional similarity for retrieval, resulting in reductions in latency and error cascades, and measurable accuracy gains (Sun et al., 26 Jan 2025). Analysis in the referenced work demonstrates that both local (token-level) and global (embedding-level, contrastive) supervision are required to bridge the acoustic-semantic gap and realize the full benefit of end-to-end speech retrieval.

This approach represents a substantial advance for the deployment of speech-enabled knowledge-intensive systems, substantially reducing both the cost and failure rate of knowledge retrieval in SLLMs and multi-modal LLMs. When compared with prevailing multi-hop controllers in text RAG (e.g., SEAL-RAG for multi-hop context dilution (Lahmy et al., 11 Dec 2025), SEER for evidence extraction (Zhao et al., 2024)), it complements the direction of optimizing information density, faithfulness, and cross-modal robustness for complex open-domain QA and retrieval-augmented inference.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SEAL-RAG.