Papers
Topics
Authors
Recent
Search
2000 character limit reached

Vision-Aligned Residual Quantization (VRQ)

Updated 19 January 2026
  • Vision-Aligned Residual Quantization (VRQ) is a hierarchical encoding approach that fuses multi-view semantic alignment with fine-grained residual quantization for robust retrieval and deduplication.
  • It integrates shallow codebooks to capture shared semantic features with deeper layers that encode instance-specific residual details, ensuring consistency and category coherence.
  • VRQ employs multi-view contrastive losses and OPQ residual refinement, achieving improved retrieval performance and efficient mixed-precision quantization in empirical evaluations.

Vision-Aligned Residual Quantization (VRQ) is a hierarchical encoding and quantization framework designed to maintain semantic alignment across multi-view representations, while preserving fine-grained residual details for individual instances. VRQ is notable for its broad applicability in vision retrieval, generative architectures, and cross-modal models requiring consistent representation mapping. It combines deep residual quantization with multi-view semantic constraints and contrastive objectives to achieve highly performant and efficient encoding, particularly in retrieval and vision-language-action domains (Zheng et al., 7 Oct 2025, Jiang et al., 27 May 2025).

1. Formal Definition and Design Objectives

VRQ is a multi-level residual quantization approach for encoding visual (and optionally textual and categorical) embeddings. Given an image embedding ff of a product or object, VRQ deterministically produces a code sequence (c0,,cL1)(c_0,…,c_{L-1}) referred to as the "Semantic ID" (SID) (Zheng et al., 7 Oct 2025). The encoding is organized hierarchically:

  • Shallow layers ((\ell small)): Capture coarse semantic features common to all views or instances from the same product, enforcing multi-view alignment.
  • Deeper layers ((\ell large)): Encode residual, product-specific information that differentiates near-duplicate but distinct instances.
  • Category-consistency: Achieved via explicit category-feature injection and margin-based losses, ensuring intra-category code coherence.

Design requirements include:

  • Consistency: Different viewpoints of the same product (e.g., front/side views) yield identical or proximate codes in shallow layers.
  • Uniqueness: Separable residual codes for distinct products, supporting robust deduplication.
  • Category-Code Coherence: Category-level clusters with suppression of off-category noise.

2. Encoder Architecture and Workflow

The VRQ encoder operates atop a visual backbone and executes in three stages (Zheng et al., 7 Oct 2025):

A) Feature Extraction & Fusion

  • Extract visual vector v=Ev(x)v = \mathcal{E}_v(x) from image xx.
  • Extract category/text vector t=Et(y)t = \mathcal{E}_t(y) from label yy.
  • Fuse features:

f=(1α)v+αt+fcat,α=sigmoid(MLP([v;t])),fcat=MLP([v;t])f = (1-\alpha)\cdot v + \alpha\cdot t + f_{cat}, \qquad \alpha = \text{sigmoid}(\text{MLP}([v;t])),\quad f_{cat}=\text{MLP}([v;t])

B) Hierarchical RQ-VAE Encoding (=0L11)(\ell = 0 … L_1-1)

  • Initialize r0=fr_0 = f.
  • For each layer:
    • Assign code: c=argmink[K]rek()2c_\ell = \arg\min_{k \in [K_\ell]} \|r_\ell - e_k^{(\ell)}\|_2
    • Update residual: r+1=rec()r_{\ell+1} = r_\ell - e_{c_\ell}^{(\ell)}
  • Shallow codebooks E(0L11)E^{(0…L_1-1)} are trained with contrastive alignment objectives.

C) Residual OPQ Refinement (=L1L1)(\ell = L_1…L-1)

  • Fuse final residual rL1r_{L_1} with business statistics bb (e.g., clicks, price).
  • Quantize with Optimized Product Quantization to produce cL1cL1c_{L_1}…c_{L-1}.

The full SID (c0,,cL1)(c_0,…,c_{L-1}) can be decoded as y^==0L1ec()\hat{y} = \sum_{\ell=0}^{L-1} e_{c_\ell}^{(\ell)} or consumed as symbolic input to downstream modules.

3. Mathematical Formulation and Loss Functions

VRQ employs a rigorously designed loss suite to enforce multi-view and hierarchical consistency:

  • Multi-View Alignment Losses:

    • Pairwise contrastive alignment:

    Lalign=λ1Lcl+λ2LcircleL_{align} = \lambda_1 L_{cl} + \lambda_2 L_{circle}

    where LclL_{cl} is SimCLR contrastive loss, LcircleL_{circle} the circle loss. - Fused-feature consistency:

    Lcons=1Ni=1N[logexp(fi(1)fi(2)/τ)j=1Nexp(fi(1)fj(2)/τ)+logexp(vi(1)fi(2)/τ)j=1Nexp(vi(1)fj(2)/τ)]L_{cons} = -\frac{1}{N} \sum_{i=1}^N \left[\log\frac{\exp(f_i^{(1)} \cdot f_i^{(2)}/\tau)}{\sum_{j=1}^N \exp(f_i^{(1)} \cdot f_j^{(2)}/\tau)} + \log\frac{\exp(v_i^{(1)} \cdot f_i^{(2)}/\tau)}{\sum_{j=1}^N \exp(v_i^{(1)} \cdot f_j^{(2)}/\tau)}\right] - Category margin loss:

    Lmar=1Ni,jmax(0,γ+sijf2fsiiv2f)L_{mar} = \frac{1}{N} \sum_{i,j} \max(0, -\gamma + s_{ij}^{f2f} - s_{ii}^{v2f})

    with sijf2f=fifjs_{ij}^{f2f}=f_i \cdot f_j, siiv2f=vifis_{ii}^{v2f}=v_i \cdot f_i.

  • Commitment Loss for RQ-VAE Codebooks:

    Lcommit==0L11rsg(ec())2L_{commit} = \sum_{\ell=0}^{L_1-1} \| r_\ell - \text{sg}(e_{c_\ell}^{(\ell)}) \|^2

    using stop-gradient and EMA updates.

  • Hierarchical Consistency:

    Lhc==0L11y^i()y^j()2,y^()=k=0eck(k)L_{hc} = \sum_{\ell=0}^{L_1-1} \| \hat{y}_i^{(\ell)} - \hat{y}_j^{(\ell)} \|^2,\quad \hat{y}^{(\ell)} = \sum_{k=0}^{\ell} e_{c_k}^{(k)}

  • Full VRQ Objective:

    LVRQ=β1Lcons+β2Lmar+β3Lcommit+β4LhcL_{VRQ} = \beta_1 L_{cons} + \beta_2 L_{mar} + \beta_3 L_{commit} + \beta_4 L_{hc}

4. Training, Initialization, and Regularization

Codebooks are initialized via RQ-KMeans centroids and updated by backpropagation on LVRQL_{VRQ}, utilizing exponential moving average (EMA) for stability. The deeper OPQ layers jointly quantize the residual and auxiliary ("business") features. Multi-view contrastive batches of size 4096, with temperature τ\tau and margin γ\gamma tuned on validation, are standard. This maintains inter-layer and inter-instance differentiation even at scale (Zheng et al., 7 Oct 2025).

Low-rank residual quantization and dequantization, as described in EaqVLA (Jiang et al., 27 May 2025), can further reduce cross-modal misalignment:

  • Compute residual: ΔW=WFP16Q(Wb)\Delta W_\ell = W_\ell^{FP16} - Q(W_\ell^{b_\ell})
  • Encode as UVTU_\ell V_\ell^T, with U,VU_\ell, V_\ell in INT8, and inference-dequantize via Q+(W)=Q(Wb)+α(UVT),  α[0,1]Q^+(W_\ell) = Q(W_\ell^{b_\ell}) + \alpha(U_\ell V_\ell^T),\; \alpha \in [0,1].

5. Empirical Evaluation and Ablation

VRQ demonstrates superior retrieval quality and alignment compared to baseline methods.

Encoder Method HR@10 (%) MRR@10 (%) HR@4 (%) Code Occupancy (ICO)
RQ-KMeans 77.4 58.6 89.98 4.84
VRQ (no personalization) 82.29 62.46 94.13 3.78
Multi-stage cascade 83.89 61.37 - -
GENIUS RQ-VAE - - 92.34 -
FSQ - - 98.47 -

Ablations reveal:

  • Incremental increases in depth LL and codebook size KK (e.g., 2562048256 \to 2048) sharply improve recall and mean reciprocal rank (MRR), saturating past L=4L=4.
  • OPQ residual refinement yields HR@10 gains (e.g. 80.72% for RQ-OPQ, 82.29% for full VRQ).
  • VRQ achieves lower ICO (fewer collisions) and higher generative retrieval (GR) compared to unsupervised RQ-KMeans and vanilla RQ-VAE.

In the EaqVLA context, VRQ enables mixed-precision quantization levels (4/16/8/4 bits per module) to achieve >60% memory reduction and ×2.3–2.4 inference speedup, with task success rates within 1% of FP16 on LIBERO benchmarks (Jiang et al., 27 May 2025). Skipping projector quantization in VLA models is essential; otherwise, success rates drop by >30%.

6. Multi-View Alignment and Interpretability

A two-view example (e.g., shoe: front/side) illustrates VRQ's interpretability:

  • Extract separate visual embeddings, fuse with category text.
  • Shallow codebooks consistently assign general-product codes (e.g., “sneakers-general”, “white_upper”) across views.
  • At deeper (residual) layers, codes diverge to capture specific visual differences (e.g., “lace-curvature” versus “toe-shape”).
  • This demonstrates multi-view invariance in shallow codes and fine-grained detail in deeper residual codes, supporting both robust retrieval and deduplication.

This suggests VRQ's encoding is designed for cross-view symbolic alignment while retaining sufficient discriminative capacity for product-level identification.

7. Limitations and Future Prospects

Current implementations treat certain modules (e.g., cross-modal projectors) as atomic and do not quantize them, pointing to potential inefficiencies. The VRQ framework could be extended:

  • To include partial or structured quantization of currently indivisible blocks (e.g., projectors).
  • Via entropy-constrained residual (fusing) coding to further minimize memory overhead.
  • By incorporating higher-order gradient sensitivities or direct cross-attention alignment losses during training.
  • For broader embodied intelligence and generative retrieval settings, leveraging the semantic alignment and discriminative residuals characteristic of VRQ (Zheng et al., 7 Oct 2025, Jiang et al., 27 May 2025).

VRQ’s combination of multi-view contrastive alignment across shallow codebooks with OPQ for residuals produces semantically robust SIDs, enabling scalable, efficient, and interpretable encoding for vision-centric and cross-modal applications. Empirical benchmarking positions VRQ as competitive with both traditional and contemporary quantization techniques, approaching or surpassing industrial pipelines in efficiency and retrieval fidelity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vision-Aligned Residual Quantization (VRQ).