OneVision: An End-to-End Generative Framework for Multi-view E-commerce Vision Search (2510.05759v2)

Published 7 Oct 2025 in cs.CV

Abstract: Traditional vision search, similar to search and recommendation systems, follows the multi-stage cascading architecture (MCA) paradigm to balance efficiency and conversion. Specifically, the query image undergoes feature extraction, recall, pre-ranking, and ranking stages, ultimately presenting the user with semantically similar products that meet their preferences. This multi-view representation discrepancy of the same object in the query and the optimization objective collide across these stages, making it difficult to achieve Pareto optimality in both user experience and conversion. In this paper, an end-to-end generative framework, OneVision, is proposed to address these problems. OneVision builds on VRQ, a vision-aligned residual quantization encoding, which can align the vastly different representations of an object across multiple viewpoints while preserving the distinctive features of each product as much as possible. Then a multi-stage semantic alignment scheme is adopted to maintain strong visual similarity priors while effectively incorporating user-specific information for personalized preference generation. In offline evaluations, OneVision performs on par with online MCA, while improving inference efficiency by 21% through dynamic pruning. In A/B tests, it achieves significant online improvements: +2.15% item CTR, +2.27% CVR, and +3.12% order volume. These results demonstrate that a semantic ID centric, generative architecture can unify retrieval and personalization while simplifying the serving pathway.

Summary

The paper presents an end-to-end generative framework that unifies retrieval and personalization by integrating vision-aligned residual quantization.
It leverages a multi-stage pipeline with pretraining, supervised fine-tuning, and dynamic pruning to enhance accuracy and inference speed.
Offline and online evaluations demonstrate significant gains in HitRate, Mean Reciprocal Ranking, CTR, and order volume compared to traditional methods.

OneVision: An End-to-End Generative Framework for Multi-view E-commerce Vision Search

The paper "OneVision: An End-to-End Generative Framework for Multi-view E-commerce Vision Search" presents a novel generative approach to e-commerce vision search, addressing limitations inherent to traditional multi-stage cascading architectures (MCA) used in vision search systems. The approach uniquely employs an end-to-end framework to enhance both retrieval and personalization by effectively integrating the stages of feature extraction, recall, pre-ranking, and ranking.

Vision-Aligned Residual Quantization

Hierarchical Encoding Challenges

The OneVision framework introduces Vision-aligned Residual Quantization (VRQ) to tackle issues with current encoding methods used in generative retrieval. Existing quantization techniques like FSQ, OPQ, and general methods such as VQ-VAE and RQ-KMeans either lack hierarchical coherence in capturing attributes among similar items or fail to maintain encoding uniqueness. These limitations hamper effective generative training and precise retrieval.

Implementation of VRQ

VRQ incorporates multi-view contrastive learning, utilizing a multi-stage pipeline for generative training, and leverages category information to enhance encoding consistency and uniqueness. In this setup, VRQ succeeds by aligning diverse visual representations across multiple viewpoints while preserving discriminative features through residual encoding and maintaining category consistency.

Multi-stage Generative Pipeline

Pretraining and Semantic Alignment

The proposed framework initiates with pretraining to establish an alignment between product images, categories, and semantic IDs (SIDs). This involves tasks focusing on SID prediction from images and incorporating additional modalities such as titles and categories to reinforce semantic alignment using cross-entropy loss.

Supervised Fine-Tuning and Personalized Modeling

After pretraining, supervised fine-tuning (SFT) enables collaborative feature learning to map multi-view images to the corresponding product SIDs. Direct Preference Optimization (DPO) is employed for personalized modeling, leveraging user behavior sequences to tailor retrieval results according to user preferences. The model applies a sophisticated list-wise DPO for refining candidate ranking based on user interaction data.

Dynamic Pruning and Efficient Inference

Pruning Techniques

To enhance inference efficiency, OneVision utilizes dynamic pruning methods, selectively retaining the most informative visual tokens to mitigate latency. K-means clustering is applied for compressing visual tokens, complemented by a distillation framework to maintain consistency between the original and pruned models.

End-to-End Inference Process

With the optimized VRQ encoding and pruning techniques, OneVision employs beam search for producing ranked candidate sets while using a Trie structure to ensure valid SID sequences during inference. This contributes to a streamlined, efficient retrieval process.

Experimental Evaluation

Offline and Online Testing

OneVision demonstrates superior performance in offline evaluations compared to traditional MCA systems, matching or exceeding their accuracy and efficiency benchmarks. Offline tests reveal improvements in HitRate (HR) and Mean Reciprocal Ranking (MRR) across various datasets. Online A/B tests on the Kuaishou platform confirm significant gains in business-critical metrics such as CTR, CVR, and order volume, underscoring the effectiveness of the unified retrieval and personalization approach.

Ablation Studies

The ablation studies provide insights into the contribution of each component of the VRQ encoding and the overall generative pipeline. Tests highlight the framework's ability to handle multi-view discrepancies and align semantic objectives effectively.

Conclusion

OneVision presents a compelling generative framework for vision search tasks in e-commerce, offering a unified approach to retrieval and personalization that simplifies the traditional pipeline. The integration of VRQ and dynamic pruning substantiates significant gains in retrieval efficiency and effectiveness, with implications for broader applications in e-commerce search and recommendation systems. Future work may focus on further optimizing integration with transformer-based architectures and exploring additional user-behavior adaptations.