- The paper presents an end-to-end generative framework that unifies retrieval and personalization by integrating vision-aligned residual quantization.
 
        - It leverages a multi-stage pipeline with pretraining, supervised fine-tuning, and dynamic pruning to enhance accuracy and inference speed.
 
        - Offline and online evaluations demonstrate significant gains in HitRate, Mean Reciprocal Ranking, CTR, and order volume compared to traditional methods.
 
    
   
 
      OneVision: An End-to-End Generative Framework for Multi-view E-commerce Vision Search
The paper "OneVision: An End-to-End Generative Framework for Multi-view E-commerce Vision Search" presents a novel generative approach to e-commerce vision search, addressing limitations inherent to traditional multi-stage cascading architectures (MCA) used in vision search systems. The approach uniquely employs an end-to-end framework to enhance both retrieval and personalization by effectively integrating the stages of feature extraction, recall, pre-ranking, and ranking.
Vision-Aligned Residual Quantization
Hierarchical Encoding Challenges
The OneVision framework introduces Vision-aligned Residual Quantization (VRQ) to tackle issues with current encoding methods used in generative retrieval. Existing quantization techniques like FSQ, OPQ, and general methods such as VQ-VAE and RQ-KMeans either lack hierarchical coherence in capturing attributes among similar items or fail to maintain encoding uniqueness. These limitations hamper effective generative training and precise retrieval.
Implementation of VRQ
VRQ incorporates multi-view contrastive learning, utilizing a multi-stage pipeline for generative training, and leverages category information to enhance encoding consistency and uniqueness. In this setup, VRQ succeeds by aligning diverse visual representations across multiple viewpoints while preserving discriminative features through residual encoding and maintaining category consistency.
Multi-stage Generative Pipeline
Pretraining and Semantic Alignment
The proposed framework initiates with pretraining to establish an alignment between product images, categories, and semantic IDs (SIDs). This involves tasks focusing on SID prediction from images and incorporating additional modalities such as titles and categories to reinforce semantic alignment using cross-entropy loss.
Supervised Fine-Tuning and Personalized Modeling
After pretraining, supervised fine-tuning (SFT) enables collaborative feature learning to map multi-view images to the corresponding product SIDs. Direct Preference Optimization (DPO) is employed for personalized modeling, leveraging user behavior sequences to tailor retrieval results according to user preferences. The model applies a sophisticated list-wise DPO for refining candidate ranking based on user interaction data.
Dynamic Pruning and Efficient Inference
Pruning Techniques
To enhance inference efficiency, OneVision utilizes dynamic pruning methods, selectively retaining the most informative visual tokens to mitigate latency. K-means clustering is applied for compressing visual tokens, complemented by a distillation framework to maintain consistency between the original and pruned models.
End-to-End Inference Process
With the optimized VRQ encoding and pruning techniques, OneVision employs beam search for producing ranked candidate sets while using a Trie structure to ensure valid SID sequences during inference. This contributes to a streamlined, efficient retrieval process.
Experimental Evaluation
Offline and Online Testing
OneVision demonstrates superior performance in offline evaluations compared to traditional MCA systems, matching or exceeding their accuracy and efficiency benchmarks. Offline tests reveal improvements in HitRate (HR) and Mean Reciprocal Ranking (MRR) across various datasets. Online A/B tests on the Kuaishou platform confirm significant gains in business-critical metrics such as CTR, CVR, and order volume, underscoring the effectiveness of the unified retrieval and personalization approach.
Ablation Studies
The ablation studies provide insights into the contribution of each component of the VRQ encoding and the overall generative pipeline. Tests highlight the framework's ability to handle multi-view discrepancies and align semantic objectives effectively.
Conclusion
OneVision presents a compelling generative framework for vision search tasks in e-commerce, offering a unified approach to retrieval and personalization that simplifies the traditional pipeline. The integration of VRQ and dynamic pruning substantiates significant gains in retrieval efficiency and effectiveness, with implications for broader applications in e-commerce search and recommendation systems. Future work may focus on further optimizing integration with transformer-based architectures and exploring additional user-behavior adaptations.