Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 152 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 94 tok/s Pro
Kimi K2 212 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

MM-RQ-VAE: Unified Multimodal Quantized VAE

Updated 3 September 2025
  • The paper introduces a unified framework that hierarchically quantizes continuous latent embeddings into discrete semantic tokens for various modalities.
  • It employs MMD-based reconstruction and cross-modal contrastive losses to ensure robust semantic alignment and precise distance preservation.
  • The approach scales for high-dimensional recommendation, retrieval, and generative tasks, integrating seamlessly with LLMs for adaptive modal fusion.

A Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE) constitutes a unified framework for learning discrete, compositional representations across diverse modalities such as images, text, audio, and collaborative embeddings. It integrates the hierarchical residual quantization mechanisms of RQ-VAE with principled multimodal fusion strategies and contrastive objectives, enabling robust semantic alignment, distance preservation, and scalable latent modeling suitable for high-dimensional recommendation, retrieval, and generative tasks.

1. Conceptual Foundation and Architecture

MM-RQ-VAE extends standard VAE-based multimodal architectures by hierarchically quantizing continuous latent representations. For each modality jj (such as collaborative, visual, or textual features), a modality-specific encoder EjE_j maps the raw input sjs_j to a semantic latent embedding zjz_j. A multi-level residual quantization is then employed, such that:

  • At quantization level ll, given input residual rl1r_{l-1} (with r0=zjr_0 = z_j), the nearest codeword CESIDj(l)lCE^l_{SID_j^{(l)}} in codebook CjlC^l_j is selected by minimizing Euclidean distance:

SIDj(l)=argminkrl1CEkl2SID_j^{(l)} = \arg\min_k \| r_{l-1} - CE^l_k \|^2

rl=rl1CESIDj(l)lr_l = r_{l-1} - CE^l_{SID_j^{(l)}}

  • After LL quantization stages, the final quantized latent for modality jj is zj,quant=l=1LCESIDj(l)lz_{j,\text{quant}} = \sum_{l=1}^L CE^l_{SID_j^{(l)}}.
  • A decoder DjD_j reconstructs the original modality embedding from zj,quantz_{j,\text{quant}}.

The architecture supports parallel quantization for multiple modalities and can employ contrastive modules for cross-modal semantic alignment.

2. Training Objectives and Losses

The MM-RQ-VAE framework balances several loss components:

Loss Term Description Role
LRecon\mathcal{L}_{\text{Recon}} MMD-based reconstruction loss: b,jMMDk2(SG(sjb),s^jb)\sum_{b, j} \text{MMD}_k^2(\text{SG}(s_j^b), \hat{s}_j^b) Preserves intra-modal distances, robust to embedding collapse
LRQ-VAE\mathcal{L}_{\text{RQ-VAE}} Residual quantization penalty per level: l=1L(SG(rl1)CESIDj(l)2+αrl1SG(CESIDj(l))2)\sum_{l=1}^L \big( \|\text{SG}(r_{l-1}) - CE_{SID_j^{(l)}}\|^2 + \alpha \|r_{l-1} - \text{SG}(CE_{SID_j^{(l)}})\|^2 \big) Ensures quantization fidelity and codebook commitment
LAlign\mathcal{L}_{\text{Align}} Cross-modal contrastive loss (e.g., InfoNCE): 1Nilogexp(z^ci,z^ti/ϵ)iexp(z^ci,z^ti/ϵ)-\frac{1}{N}\sum_i \log \frac{\exp(\langle \hat{z}_c^i, \hat{z}_t^i \rangle / \epsilon)}{\sum_{i'} \exp(\langle \hat{z}_c^i, \hat{z}_t^{i'} \rangle / \epsilon)} Enforces inter-modal semantic correlation

The combined objective is:

LMM-RQ-VAE=LRecon+βLAlign+γjLRQ-VAE\mathcal{L}_{\text{MM-RQ-VAE}} = \mathcal{L}_{\text{Recon}} + \beta \mathcal{L}_{\text{Align}} + \gamma \sum_j \mathcal{L}_{\text{RQ-VAE}}

with hyperparameters β\beta, γ\gamma controlling modality fusion and quantization rigor (Wang et al., 2 Sep 2025).

3. Modalities, Fusion, and Semantic Tokenization

  • The model accommodates collaborative (ID-based), text, and image features via separate encoders and codebooks.
  • Quantized embeddings (semantic IDs) encode hierarchical semantic relations, promoting flexible fusion and scalable tokenization.
  • The initialization of semantic ID embeddings is performed using pretrained code embeddings from MM-RQ-VAE, significantly mitigating catastrophic forgetting and preserving intra-modal relational metrics such as Kendall’s tau (Wang et al., 2 Sep 2025).

4. Integration with LLMs

MM-RQ-VAE outputs are interfaced with LLMs by remapping quantized multimodal features and semantic IDs into the high-dimensional LLM token space. This integration addresses embedding collapse by retaining the rank and diversity of input embeddings. Fine-tuning (e.g., via LoRA) proceeds with frequency-aware modal fusion, supporting efficient inference and adaptive recombination of modality channels.

5. Theoretical and Empirical Properties

  • Maximum Mean Discrepancy (MMD) as the reconstruction loss provides robustness in aligning sample distributions and maintaining meaningful feature distances.
  • Hierarchical quantization avoids codebook collapse by distributing residual information across levels, comparable with HQ-VAE's Bayesian self-annealing mechanism (Takida et al., 2023).
  • Cross-modal contrastive losses (InfoNCE) align quantized modalities, facilitating semantic generalization and retrieval accuracy.
  • Benchmarks demonstrate superior preservation of distance metrics, expanded embedding rank, and improved sequential recommendation measures (e.g., Hit Ratio, nDCG) compared to prior approaches using raw embeddings or non-quantized semantic IDs (Wang et al., 2 Sep 2025).

6. Generalization to Broader Multimodal Tasks

The MM-RQ-VAE design philosophy translates to a variety of multimodal generative settings:

7. Future Directions and Challenges

  • Expanding MM-RQ-VAE with additional modalities (e.g., audio, structured metadata) may further improve semantic discrimination and robustness.
  • Adaptive codebook strategies, self-supervised contrastive alignment, and fine-grained semantic residual extraction (disentangling general and specific components) could encourage richer representation learning in next-generation multimodal VAEs.
  • A plausible implication is that MM-RQ-VAE models can be deployed in recommendation, retrieval, and generative systems where cross-modal distance preservation and semantic alignment are critical, potentially extending to LLM-enhanced conversational search and cross-modal generation.

Summary Table: MM-RQ-VAE Key Features

Feature Mechanism Impact
Hierarchical Residual Quant. Multi-level codebooks, residual updates Discrete semantic tokenization
MMD Reconstruction Loss Kernel mean alignment of original/decoded embedding Distance preservation, anti-collapse
Cross-modal Contrastive Loss InfoNCE between quantized modalities Alignment, inter-modal correlation
Semantic ID Initialization Pretrained code embedding transfer Mitigates catastrophic forgetting
Multimodal Fusion Adaptively fused channels, LLM integration Scalable cross-domain recommendation

MM-RQ-VAE thus provides a principled, scalable, and semantically robust approach for unified multimodal representation and cross-modal interaction, synthesizing hierarchical quantization, kernel-based reconstruction, and contrastive fusion within contemporary deep generative frameworks.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE).