Vision-Aligned Residual Quantization (VRQ)

Updated 31 October 2025

The paper introduces VRQ, which uses supervised, hierarchical residual quantization to enforce multi-view consistency and fine-grained discrimination in visual retrieval.
It integrates contrastive and margin losses with business attribute augmentation to produce robust semantic IDs for unified e-commerce search.
Empirical evaluations show significant improvements in HR, CTR, and personalization over traditional methods, validating VRQ’s effectiveness in generative retrieval frameworks.

Vision-aligned Residual Quantization (VRQ) is a supervised, hierarchical representation learning and quantization technique for multi-view visual retrieval. It aligns feature embeddings of the same object captured under varied perspectives while enabling fine-grained discrimination among products and efficient integration of business-relevant attributes. VRQ serves as the encoding backbone in end-to-end generative retrieval frameworks, such as OneVision, providing discrete semantic IDs (SIDs) for unified vision search, personalization, and catalog management in e-commerce environments (Zheng et al., 7 Oct 2025).

1. Conceptual Motivation and Multi-View Alignment

VRQ addresses the problem of discrepant representations in multi-view e-commerce vision search, wherein a product may be depicted in various images reflecting different backgrounds, orientations, or acquisition conditions. Conventional quantization methods (e.g., VQ-VAE, RQ-KMeans, OPQ) often yield inconsistent codes for the same object, impeding reliable recall and ranking.

The essential goals for VRQ are:

Multi-view Consistency: Ensuring that different views/images of the same product map to the same codes in the shallow quantization levels.
Discrimination: Maintaining the separability of distinct products, especially at deeper quantization levels.
Supervised Codebook Training: Exploiting annotated product metadata (category, behavioral signals) to enhance the training of codebooks for residual quantization.

VRQ integrates multi-view contrastive learning objectives and supervised category alignment to enforce these properties, directly mitigating representation drift and reducing mismatches in generative retrieval.

2. Mathematical Formulation and Encoding Pipeline

The VRQ workflow is characterized by hierarchical, trainable residual quantization with the following key steps:

A. Feature Fusion

Let an image $x$ and its category label $y$ be encoded as:

Visual feature: $v := \mathcal{E}_v(x)$
Category feature: $t := \mathcal{E}_t(y)$

A dynamic fusion produces the final representation: $f = (1-\alpha) \cdot v + \alpha \cdot t + f_{\text{cat}}$ where $\alpha$ is computed by an MLP-sigmoid over concatenated features $(v \Vert t)$ , and $f_{\text{cat}}$ is an additional MLP mapping of image-text embeddings.

B. Multi-View Contrastive and Category Loss

Let $f_i^{(1)}$ and $f_i^{(2)}$ be features from two views of product $i$ : $\mathcal{L}_{\mathrm{cons}} = -\frac{1}{N}\sum_{i=1}^N \Bigg[ \log \frac{\exp(f_i^{(1)} \cdot f_i^{(2)} / \tau)}{\sum_{j}\exp(f_i^{(1)} \cdot f_j^{(2)} / \tau)} + \log \frac{\exp(v_i^{(1)} \cdot f_i^{(2)} / \tau)}{\sum_{j}\exp(v_i^{(1)} \cdot f_j^{(2)} / \tau)} \Bigg]$ A margin loss further promotes category-level separation: $\mathcal{L}_{\mathrm{mar}} = \frac{1}{N} \sum_{i,j} \max\left(0,\ -\gamma + s_{ij}^{f2f} - s_{ii}^{v2f}\right)$ where $s_{ij}^{f2f}$ is similarity between $i$ -th and $j$ -th fused features.

C. Hierarchical Residual Quantization

The fused feature $f$ is decomposed via $L$ codebook layers:

For $l = 0 \ldots L-1$ $l = 0 \dots L - 1$ :
1. Select code $c_l = \arg\min_k || r_l - e_k^{(l)} ||$
2. Residual update: $r_{l+1} = r_l - e_{c_l}^{(l)}$
3. Quantized feature at level $l$ : $\hat{f}^{(l)} = \sum_{i=0}^{l} e_{c_i}^{(i)}$
Commitment loss ensures assignment proximity: $\mathcal{L}_{\text{commit}} = \sum_{l=0}^{L-1} || r_{l} - \operatorname{sg}(e_{c_l}^{(l)})||^{2}$ where $\operatorname{sg}$ denotes stop-grad.
Hierarchical consistency loss for multi-view pairs: $\mathcal{L}_{\mathrm{hc}} = \sum_{l=0}^{L-1} || \hat{f}_i^{(l)} - \hat{f}_j^{(l)} ||^{2}$

Initialization utilizes RQ-KMeans for initial codebook centroids, followed by full VRQ training.

D. Residual OPQ and Business Attribute Augmentation

Deep-level (final) residuals are further augmented with business-related statistics (e.g., clicks, conversion, price) and quantized via OPQ for additional discrimination. This permits differentiation among visually similar—but semantically or commercially distinct—items.

E. Full Objective

The VRQ loss aggregates all objectives: $\mathcal{L}_{\mathrm{rq}} = \beta_1\mathcal{L}_{\mathrm{cons}} + \beta_2\mathcal{L}_{\mathrm{mar}} + \beta_3\mathcal{L}_{\mathrm{commit}} + \beta_4\mathcal{L}_{\mathrm{hc}}$ This suggests direct tunability for different application domains by varying $\beta_i$ .

3. Properties: Alignment, Discrimination, and Personalization

By leveraging multi-view pairs, VRQ enforces codebook assignments that yield "identical or closely related codes" for the same product's varied images at shallow quantization levels. Category and business attribute augmentation ensure preserved discriminative capacity for item uniqueness at deeper levels.

In practice, this architecture supports:

Unification of query and catalog product images under consistent SIDs.
Robustness to viewpoint variation and background shifts.
Extension to personalized search via fused behavioral codes.

4. Integration into Generative Vision Retrieval Paths

Within the OneVision framework, VRQ serves as the encoding mechanism for SIDs, structuring the pipeline as follows:

SID Encoding Backbone: Query and candidate items are encoded using VRQ to generate discrete codes.
Unified Cross-modal Alignment: VRQ-encoded SIDs align images, textual product descriptions, category metadata, and behavioral/business signals.
Fine-Tuning for Collaborative Learning: Supervised training over VRQ-SID pairs enhances retrieval connections, especially between view-varying queries and catalog entries.
Personalization and Validity: The generative decoder produces SIDs restricted to the VRQ code space, enhancing retrieval precision and quality assurance.

5. Comparative Advantages over Traditional Quantization Methods

VRQ improves upon standard methods such as RQ-KMeans, RQ-VAE, and OPQ by:

Method	Multi-view Consistency	Discriminative Power	Category Consistency	Personalization/Business Fusion	Generative Search Optimization
RQ-KMeans	✗	✓	✗	✗	✗
RQ-VAE	Partial	Partial	✗	✗	Partial
VRQ	✓	✓	✓	✓	✓

Advantages include:

Superior consistency of multi-view codes, crucial for e-commerce retrieval.
Augmented discrimination via hierarchical structure and business attribute fusion.
Explicit mitigation of category-level confusion.
Optimized encoding for generative models (sequence decoding), supporting both retrieval and personalization.

6. Empirical Validation and Ablation Findings

VRQ demonstrates empirical improvements validated through multiple ablations and benchmark comparisons:

Quantitative Uplift: OneVision with VRQ achieves +8.64% HR@1 and +11.07% HR@4 over RQ-KMeans (Table 1).
Business Feature Impact: Fusion with behavioral and commercial features yields an additional +3.22 HR and +3.55 MRR points.
Model Depth and Codebook Size: Larger codebook depth/size improves retrieval quality up to saturation points (Fig. 4).
Category-level Generalization: Multi-view alignment via VRQ lifts CTR in both frequent and rare product categories.
Offline/Online Trials: Real-world A/B tests demonstrate significant gains (+2.15% item CTR, +2.27% CVR, +3.12% orders, +3.60% OPM; Table 5) over MCA baselines.

Ablations reveal VRQ's dependency on multi-view contrastive objectives and business attribute fusion; omission degrades performance. FSQ achieves high quantization assignment scores but fails in generative retrieval, indicating VRQ’s critical role in semantic alignment.

7. Significance and Prospective Implications

VRQ advances the state-of-the-art for representation quantization in vision search, directly addressing the challenge of cross-view consistency, semantic discrimination, and personalization at scale. Its hierarchical, supervised training, coupled with integration of auxiliary (business, behavioral) signals, positions VRQ as the encoding backbone for end-to-end generative frameworks in complex catalog environments.

A plausible implication is the extensibility of VRQ beyond e-commerce retrieval to any domain requiring robust, consistent, and discriminative quantized representations across varied views, such as multi-modal medical imaging or video surveillance. The modularity of its loss functions enables further tuning for specialized objectives, subject to future empirical paper.

References: VRQ is introduced and evaluated within the OneVision framework (Zheng et al., 7 Oct 2025); comparative quantization approaches are discussed in the context of e-commerce retrieval; implementation details and all empirical findings are traceable to the cited sections, equations, and tables of the referenced work.

PDF Markdown Chat (Pro)

References (1)

OneVision: An End-to-End Generative Framework for Multi-view E-commerce Vision Search (2025)

Follow Topic

Get notified by email when new papers are published related to Vision-aligned Residual Quantization (VRQ).