Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 40 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 161 tok/s Pro
2000 character limit reached

MM-RQ-VAE: Unified Multimodal Quantized VAE

Updated 3 September 2025
  • The paper introduces a unified framework that hierarchically quantizes continuous latent embeddings into discrete semantic tokens for various modalities.
  • It employs MMD-based reconstruction and cross-modal contrastive losses to ensure robust semantic alignment and precise distance preservation.
  • The approach scales for high-dimensional recommendation, retrieval, and generative tasks, integrating seamlessly with LLMs for adaptive modal fusion.

A Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE) constitutes a unified framework for learning discrete, compositional representations across diverse modalities such as images, text, audio, and collaborative embeddings. It integrates the hierarchical residual quantization mechanisms of RQ-VAE with principled multimodal fusion strategies and contrastive objectives, enabling robust semantic alignment, distance preservation, and scalable latent modeling suitable for high-dimensional recommendation, retrieval, and generative tasks.

1. Conceptual Foundation and Architecture

MM-RQ-VAE extends standard VAE-based multimodal architectures by hierarchically quantizing continuous latent representations. For each modality jj (such as collaborative, visual, or textual features), a modality-specific encoder EjE_j maps the raw input sjs_j to a semantic latent embedding zjz_j. A multi-level residual quantization is then employed, such that:

  • At quantization level ll, given input residual rl1r_{l-1} (with r0=zjr_0 = z_j), the nearest codeword CESIDj(l)lCE^l_{SID_j^{(l)}} in codebook CjlC^l_j is selected by minimizing Euclidean distance:

SIDj(l)=argminkrl1CEkl2SID_j^{(l)} = \arg\min_k \| r_{l-1} - CE^l_k \|^2

rl=rl1CESIDj(l)lr_l = r_{l-1} - CE^l_{SID_j^{(l)}}

  • After LL quantization stages, the final quantized latent for modality jj is zj,quant=l=1LCESIDj(l)lz_{j,\text{quant}} = \sum_{l=1}^L CE^l_{SID_j^{(l)}}.
  • A decoder DjD_j reconstructs the original modality embedding from zj,quantz_{j,\text{quant}}.

The architecture supports parallel quantization for multiple modalities and can employ contrastive modules for cross-modal semantic alignment.

2. Training Objectives and Losses

The MM-RQ-VAE framework balances several loss components:

Loss Term Description Role
LRecon\mathcal{L}_{\text{Recon}} MMD-based reconstruction loss: b,jMMDk2(SG(sjb),s^jb)\sum_{b, j} \text{MMD}_k^2(\text{SG}(s_j^b), \hat{s}_j^b) Preserves intra-modal distances, robust to embedding collapse
LRQ-VAE\mathcal{L}_{\text{RQ-VAE}} Residual quantization penalty per level: l=1L(SG(rl1)CESIDj(l)2+αrl1SG(CESIDj(l))2)\sum_{l=1}^L \big( \|\text{SG}(r_{l-1}) - CE_{SID_j^{(l)}}\|^2 + \alpha \|r_{l-1} - \text{SG}(CE_{SID_j^{(l)}})\|^2 \big) Ensures quantization fidelity and codebook commitment
LAlign\mathcal{L}_{\text{Align}} Cross-modal contrastive loss (e.g., InfoNCE): 1Nilogexp(z^ci,z^ti/ϵ)iexp(z^ci,z^ti/ϵ)-\frac{1}{N}\sum_i \log \frac{\exp(\langle \hat{z}_c^i, \hat{z}_t^i \rangle / \epsilon)}{\sum_{i'} \exp(\langle \hat{z}_c^i, \hat{z}_t^{i'} \rangle / \epsilon)} Enforces inter-modal semantic correlation

The combined objective is:

LMM-RQ-VAE=LRecon+βLAlign+γjLRQ-VAE\mathcal{L}_{\text{MM-RQ-VAE}} = \mathcal{L}_{\text{Recon}} + \beta \mathcal{L}_{\text{Align}} + \gamma \sum_j \mathcal{L}_{\text{RQ-VAE}}

with hyperparameters β\beta, γ\gamma controlling modality fusion and quantization rigor (Wang et al., 2 Sep 2025).

3. Modalities, Fusion, and Semantic Tokenization

  • The model accommodates collaborative (ID-based), text, and image features via separate encoders and codebooks.
  • Quantized embeddings (semantic IDs) encode hierarchical semantic relations, promoting flexible fusion and scalable tokenization.
  • The initialization of semantic ID embeddings is performed using pretrained code embeddings from MM-RQ-VAE, significantly mitigating catastrophic forgetting and preserving intra-modal relational metrics such as Kendall’s tau (Wang et al., 2 Sep 2025).

4. Integration with LLMs

MM-RQ-VAE outputs are interfaced with LLMs by remapping quantized multimodal features and semantic IDs into the high-dimensional LLM token space. This integration addresses embedding collapse by retaining the rank and diversity of input embeddings. Fine-tuning (e.g., via LoRA) proceeds with frequency-aware modal fusion, supporting efficient inference and adaptive recombination of modality channels.

5. Theoretical and Empirical Properties

  • Maximum Mean Discrepancy (MMD) as the reconstruction loss provides robustness in aligning sample distributions and maintaining meaningful feature distances.
  • Hierarchical quantization avoids codebook collapse by distributing residual information across levels, comparable with HQ-VAE's Bayesian self-annealing mechanism (Takida et al., 2023).
  • Cross-modal contrastive losses (InfoNCE) align quantized modalities, facilitating semantic generalization and retrieval accuracy.
  • Benchmarks demonstrate superior preservation of distance metrics, expanded embedding rank, and improved sequential recommendation measures (e.g., Hit Ratio, nDCG) compared to prior approaches using raw embeddings or non-quantized semantic IDs (Wang et al., 2 Sep 2025).

6. Generalization to Broader Multimodal Tasks

The MM-RQ-VAE design philosophy translates to a variety of multimodal generative settings:

7. Future Directions and Challenges

  • Expanding MM-RQ-VAE with additional modalities (e.g., audio, structured metadata) may further improve semantic discrimination and robustness.
  • Adaptive codebook strategies, self-supervised contrastive alignment, and fine-grained semantic residual extraction (disentangling general and specific components) could encourage richer representation learning in next-generation multimodal VAEs.
  • A plausible implication is that MM-RQ-VAE models can be deployed in recommendation, retrieval, and generative systems where cross-modal distance preservation and semantic alignment are critical, potentially extending to LLM-enhanced conversational search and cross-modal generation.

Summary Table: MM-RQ-VAE Key Features

Feature Mechanism Impact
Hierarchical Residual Quant. Multi-level codebooks, residual updates Discrete semantic tokenization
MMD Reconstruction Loss Kernel mean alignment of original/decoded embedding Distance preservation, anti-collapse
Cross-modal Contrastive Loss InfoNCE between quantized modalities Alignment, inter-modal correlation
Semantic ID Initialization Pretrained code embedding transfer Mitigates catastrophic forgetting
Multimodal Fusion Adaptively fused channels, LLM integration Scalable cross-domain recommendation

MM-RQ-VAE thus provides a principled, scalable, and semantically robust approach for unified multimodal representation and cross-modal interaction, synthesizing hierarchical quantization, kernel-based reconstruction, and contrastive fusion within contemporary deep generative frameworks.