Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 173 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 76 tok/s Pro
Kimi K2 202 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Vision-Aligned Residual Quantization (VRQ)

Updated 31 October 2025
  • The paper introduces VRQ, which uses supervised, hierarchical residual quantization to enforce multi-view consistency and fine-grained discrimination in visual retrieval.
  • It integrates contrastive and margin losses with business attribute augmentation to produce robust semantic IDs for unified e-commerce search.
  • Empirical evaluations show significant improvements in HR, CTR, and personalization over traditional methods, validating VRQ’s effectiveness in generative retrieval frameworks.

Vision-aligned Residual Quantization (VRQ) is a supervised, hierarchical representation learning and quantization technique for multi-view visual retrieval. It aligns feature embeddings of the same object captured under varied perspectives while enabling fine-grained discrimination among products and efficient integration of business-relevant attributes. VRQ serves as the encoding backbone in end-to-end generative retrieval frameworks, such as OneVision, providing discrete semantic IDs (SIDs) for unified vision search, personalization, and catalog management in e-commerce environments (Zheng et al., 7 Oct 2025).

1. Conceptual Motivation and Multi-View Alignment

VRQ addresses the problem of discrepant representations in multi-view e-commerce vision search, wherein a product may be depicted in various images reflecting different backgrounds, orientations, or acquisition conditions. Conventional quantization methods (e.g., VQ-VAE, RQ-KMeans, OPQ) often yield inconsistent codes for the same object, impeding reliable recall and ranking.

The essential goals for VRQ are:

  • Multi-view Consistency: Ensuring that different views/images of the same product map to the same codes in the shallow quantization levels.
  • Discrimination: Maintaining the separability of distinct products, especially at deeper quantization levels.
  • Supervised Codebook Training: Exploiting annotated product metadata (category, behavioral signals) to enhance the training of codebooks for residual quantization.

VRQ integrates multi-view contrastive learning objectives and supervised category alignment to enforce these properties, directly mitigating representation drift and reducing mismatches in generative retrieval.

2. Mathematical Formulation and Encoding Pipeline

The VRQ workflow is characterized by hierarchical, trainable residual quantization with the following key steps:

A. Feature Fusion

Let an image xx and its category label yy be encoded as:

  • Visual feature: v:=Ev(x)v := \mathcal{E}_v(x)
  • Category feature: t:=Et(y)t := \mathcal{E}_t(y)

A dynamic fusion produces the final representation: f=(1α)v+αt+fcatf = (1-\alpha) \cdot v + \alpha \cdot t + f_{\text{cat}} where α\alpha is computed by an MLP-sigmoid over concatenated features (vt)(v \Vert t), and fcatf_{\text{cat}} is an additional MLP mapping of image-text embeddings.

B. Multi-View Contrastive and Category Loss

Let fi(1)f_i^{(1)} and fi(2)f_i^{(2)} be features from two views of product ii: Lcons=1Ni=1N[logexp(fi(1)fi(2)/τ)jexp(fi(1)fj(2)/τ)+logexp(vi(1)fi(2)/τ)jexp(vi(1)fj(2)/τ)]\mathcal{L}_{\mathrm{cons}} = -\frac{1}{N}\sum_{i=1}^N \Bigg[ \log \frac{\exp(f_i^{(1)} \cdot f_i^{(2)} / \tau)}{\sum_{j}\exp(f_i^{(1)} \cdot f_j^{(2)} / \tau)} + \log \frac{\exp(v_i^{(1)} \cdot f_i^{(2)} / \tau)}{\sum_{j}\exp(v_i^{(1)} \cdot f_j^{(2)} / \tau)} \Bigg] A margin loss further promotes category-level separation: Lmar=1Ni,jmax(0, γ+sijf2fsiiv2f)\mathcal{L}_{\mathrm{mar}} = \frac{1}{N} \sum_{i,j} \max\left(0,\ -\gamma + s_{ij}^{f2f} - s_{ii}^{v2f}\right) where sijf2fs_{ij}^{f2f} is similarity between ii-th and jj-th fused features.

C. Hierarchical Residual Quantization

The fused feature ff is decomposed via LL codebook layers:

  • For l=0L1l = 0 \ldots L-1:

    1. Select code cl=argminkrlek(l)c_l = \arg\min_k || r_l - e_k^{(l)} ||
    2. Residual update: rl+1=rlecl(l)r_{l+1} = r_l - e_{c_l}^{(l)}
    3. Quantized feature at level ll: f^(l)=i=0leci(i)\hat{f}^{(l)} = \sum_{i=0}^{l} e_{c_i}^{(i)}
  • Commitment loss ensures assignment proximity: Lcommit=l=0L1rlsg(ecl(l))2\mathcal{L}_{\text{commit}} = \sum_{l=0}^{L-1} || r_{l} - \operatorname{sg}(e_{c_l}^{(l)})||^{2} where sg\operatorname{sg} denotes stop-grad.

  • Hierarchical consistency loss for multi-view pairs: Lhc=l=0L1f^i(l)f^j(l)2\mathcal{L}_{\mathrm{hc}} = \sum_{l=0}^{L-1} || \hat{f}_i^{(l)} - \hat{f}_j^{(l)} ||^{2}

Initialization utilizes RQ-KMeans for initial codebook centroids, followed by full VRQ training.

D. Residual OPQ and Business Attribute Augmentation

Deep-level (final) residuals are further augmented with business-related statistics (e.g., clicks, conversion, price) and quantized via OPQ for additional discrimination. This permits differentiation among visually similar—but semantically or commercially distinct—items.

E. Full Objective

The VRQ loss aggregates all objectives: Lrq=β1Lcons+β2Lmar+β3Lcommit+β4Lhc\mathcal{L}_{\mathrm{rq}} = \beta_1\mathcal{L}_{\mathrm{cons}} + \beta_2\mathcal{L}_{\mathrm{mar}} + \beta_3\mathcal{L}_{\mathrm{commit}} + \beta_4\mathcal{L}_{\mathrm{hc}} This suggests direct tunability for different application domains by varying βi\beta_i.

3. Properties: Alignment, Discrimination, and Personalization

By leveraging multi-view pairs, VRQ enforces codebook assignments that yield "identical or closely related codes" for the same product's varied images at shallow quantization levels. Category and business attribute augmentation ensure preserved discriminative capacity for item uniqueness at deeper levels.

In practice, this architecture supports:

  • Unification of query and catalog product images under consistent SIDs.
  • Robustness to viewpoint variation and background shifts.
  • Extension to personalized search via fused behavioral codes.

4. Integration into Generative Vision Retrieval Paths

Within the OneVision framework, VRQ serves as the encoding mechanism for SIDs, structuring the pipeline as follows:

  • SID Encoding Backbone: Query and candidate items are encoded using VRQ to generate discrete codes.
  • Unified Cross-modal Alignment: VRQ-encoded SIDs align images, textual product descriptions, category metadata, and behavioral/business signals.
  • Fine-Tuning for Collaborative Learning: Supervised training over VRQ-SID pairs enhances retrieval connections, especially between view-varying queries and catalog entries.
  • Personalization and Validity: The generative decoder produces SIDs restricted to the VRQ code space, enhancing retrieval precision and quality assurance.

5. Comparative Advantages over Traditional Quantization Methods

VRQ improves upon standard methods such as RQ-KMeans, RQ-VAE, and OPQ by:

Method Multi-view Consistency Discriminative Power Category Consistency Personalization/Business Fusion Generative Search Optimization
RQ-KMeans
RQ-VAE Partial Partial Partial
VRQ

Advantages include:

  • Superior consistency of multi-view codes, crucial for e-commerce retrieval.
  • Augmented discrimination via hierarchical structure and business attribute fusion.
  • Explicit mitigation of category-level confusion.
  • Optimized encoding for generative models (sequence decoding), supporting both retrieval and personalization.

6. Empirical Validation and Ablation Findings

VRQ demonstrates empirical improvements validated through multiple ablations and benchmark comparisons:

  • Quantitative Uplift: OneVision with VRQ achieves +8.64% HR@1 and +11.07% HR@4 over RQ-KMeans (Table 1).
  • Business Feature Impact: Fusion with behavioral and commercial features yields an additional +3.22 HR and +3.55 MRR points.
  • Model Depth and Codebook Size: Larger codebook depth/size improves retrieval quality up to saturation points (Fig. 4).
  • Category-level Generalization: Multi-view alignment via VRQ lifts CTR in both frequent and rare product categories.
  • Offline/Online Trials: Real-world A/B tests demonstrate significant gains (+2.15% item CTR, +2.27% CVR, +3.12% orders, +3.60% OPM; Table 5) over MCA baselines.

Ablations reveal VRQ's dependency on multi-view contrastive objectives and business attribute fusion; omission degrades performance. FSQ achieves high quantization assignment scores but fails in generative retrieval, indicating VRQ’s critical role in semantic alignment.

7. Significance and Prospective Implications

VRQ advances the state-of-the-art for representation quantization in vision search, directly addressing the challenge of cross-view consistency, semantic discrimination, and personalization at scale. Its hierarchical, supervised training, coupled with integration of auxiliary (business, behavioral) signals, positions VRQ as the encoding backbone for end-to-end generative frameworks in complex catalog environments.

A plausible implication is the extensibility of VRQ beyond e-commerce retrieval to any domain requiring robust, consistent, and discriminative quantized representations across varied views, such as multi-modal medical imaging or video surveillance. The modularity of its loss functions enables further tuning for specialized objectives, subject to future empirical paper.

References: VRQ is introduced and evaluated within the OneVision framework (Zheng et al., 7 Oct 2025); comparative quantization approaches are discussed in the context of e-commerce retrieval; implementation details and all empirical findings are traceable to the cited sections, equations, and tables of the referenced work.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Vision-aligned Residual Quantization (VRQ).