LEAPS: LLM-Empowered Plugin for Taobao AI Search
- LEAPS is a modular architecture that enhances e-commerce search using LLMs to expand queries and verify results for improved recall and precision.
- It integrates a Query Expander and a Relevance Verifier to broaden search coverage and filter results without altering underlying indices.
- Operational deployment in Taobao AI Search has shown reduced zero-result cases and increased click-through rates, demonstrating its real-world efficacy.
LEAPS (LLM-Empowered Adaptive Plugin for Taobao AI Search) is a modular architecture designed to transform traditional e-commerce search systems into conversational, high-recall, and high-precision platforms suitable for handling complex, natural-language user queries. LEAPS deploys two lightweight, non-invasive plugins—an upstream Query Expander and a downstream Relevance Verifier—wrapping around existing retrieval engines without any modification to the underlying indices or ranking layers. The system has been fully deployed in Taobao AI Search, serving hundreds of millions of users monthly since August 2025 (Wang et al., 9 Jan 2026).
1. "Broaden-and-Refine" Plugin Paradigm
LEAPS operates by augmenting the standard e-commerce search pipeline, which typically processes a short query through a retrieval engine to obtain top- results. The architecture integrates two plugins:
- Query Expander (): Upstream, generates a set of complementary rewrites , broadening the search boundaries to maximize candidate coverage and address zero-result scenarios. The expanded result set is .
- Relevance Verifier (): Downstream, assesses each candidate , synthesizing heterogeneous context and performing chain-of-thought (CoT) reasoning to filter for precise, relevant matches: , where denotes user profile data.
This paradigm explicitly bridges the semantic gap between high-dimensional conversational intent and short-text retrieval optimization, operating with minimal latency overhead.
2. Query Expander: Multi-Stage LLM Fine-Tuning
The Query Expander's objective is to output a diverse set of rewrites that jointly maximize recall and minimize semantic redundancy. Its training consists of three key stages:
- Inverse Data Augmentation (SFT Warm-up): Using approximately 2 million high-quality product titles, an LLM (DeepSeek-R1) generates system-style rewrites. Colloquial queries are reverse-generated to form "query", "rewrite" pairs, training the expander to minimize standard negative log-likelihood over this aligned dataset.
- Posterior-Knowledge Supervised Fine-Tuning: Real user logs are leveraged to optimize for downstream retrieval effectiveness. For each query , its core product term and attribute set are enumerated to obtain candidate rewrites. Each rewrite is evaluated in , and only those yielding the highest relevance density (as measured by ) are selected to fine-tune . This procedure ensures attribute relaxations are empirically justified.
- Diversity-Aware Reinforcement Learning: To avoid mode collapse in the generated rewrites, set-level policy optimization is used. Three custom reward signals are defined over the rewrite set :
- Hybrid Relevance (): Balances per-rewrite precision and unique contribution.
- Global Relevance (): Fraction of final relevant items among all candidates.
- Effective Relevance (): Ratio with respect to pre-deduplication candidates.
The training employs Group Sequence Policy Optimization (GSPO), which extends PPO to operate on sequence-set feedback, optimizing for both relevance and diversity under explicit constraints.
3. Relevance Verifier: Semantic Filtering with Multi-Source Data and Chain-of-Thought
The Relevance Verifier is designed to provide robust, low-latency filtering for candidate items. It incorporates:
- Multi-Source Data Integration: Input context includes item metadata (title, attributes), OCR text from images, customer reviews, price, transaction velocity, and shop/brand reputation, as well as user context like geolocation and recent interaction history. These are concatenated into prompts up to ~1,500 tokens.
- Chain-of-Thought (CoT) Reasoning: ~800,000 manually annotated query–item pairs supply both binary relevance labels and rationales, enabling multi-instruction SFT where the verifier jointly outputs and a concise justification. This enhances semantic robustness and interpretability.
- Deployment and Latency Optimizations: Candidate batching, feedback-driven slot allocation, and adaptive pagination are employed to balance recall, throughput, and response time, ensuring operational feasibility at Taobao's large scale.
4. Integration and Operational Workflow
LEAPS is architected for seamless integration in black-box search systems. Both plugins interact with the main retrieval engine solely through its public query and ingestion APIs. The high-level workflow is formalized as:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
def LEAPS_Search(query q, user_profile u): # Upstream: broaden rewrites = Expander.generate(q) # 𝔈 I_b = set() for q_prime in rewrites: I_b |= SearchEngine.submit(q_prime) # S # Downstream: refine I_r = [] for i in I_b: label, rationale = Verifier.predict(q, i, u) # 𝔙 if label == "relevant": I_r.append((i, rationale)) return I_r |
This setup permits low-cost deployment in varied settings, with no dependency on internal search engine mechanics.
5. Empirical Results and Comparative Evaluation
Offline Evaluation
Relevance Verifier Performance
The Verifier achieves high precision and recall across query frequencies, with the Tbstar2.5-16B-A2B model yielding the best performance:
| Model | Head F1 | Torso F1 | Tail F1 | All F1 |
|---|---|---|---|---|
| Qwen3-14B | 91.64 | 87.89 | 76.06 | 85.19 |
| Tbstar2.5-16B-A2B | 91.51 | 87.38 | 75.33 | 84.85 |
Query Expander Performance
LEAPS optimization outperforms production and SFT-only baselines on all major reward metrics (HR, GR, ER):
| Method | All HR | All GR | All ER |
|---|---|---|---|
| Production | 0.415 | 0.552 | 0.476 |
| SFT | 0.432 | 0.575 | 0.495 |
| LEAPS-HR | 0.555 | 0.670 | 0.580 |
| LEAPS-GR | 0.520 | 0.735 | 0.600 |
| LEAPS-ER | 0.540 | 0.720 | 0.620 |
Ablations indicate GSPO yields superior training stability and final reward compared to REINFORCE++ and GRPO. Chain-of-Thought integration further stabilizes RL training trajectories.
Online A/B Testing
At scale in Taobao AI Search, LEAPS provided:
- Low-Result Rate: Reduced from 24.88% to 16.98% (–31.7% relative).
- Click-Through Rate: Increased from 9.39% to 10.93% (+16.4% relative).
These measurements demonstrate that the expander () successfully recovers zero-result cases, while the verifier () maintains or improves precision (Wang et al., 9 Jan 2026).
6. Deployment Architecture, Latency, and Integration
- Latency: Expander (190 ms for ~180 input/20 output tokens); Verifier (150 ms for ~1,500 input/12 output tokens). Total overhead is approximately 340 ms per conversational query, aligning with Taobao's high-value conversational search requirements.
- Cost and Complexity: LEAPS effects batch inference and adaptive pagination, minimizing peak QPS and infrastructure overhead. Black-box integration ensures no changes to indices or ranking stacks.
- Retrieval Preservation: For head queries, the expander can emit the original query, with the verifier providing highly precise gating, ensuring no degradation for standard short-text search flows.
7. Limitations, Ablations, and Future Directions
Noted limitations include uniform search budget allocation per rewrite (suggesting future work on adaptive weighting using a secondary RL phase), dependence on an OCR → text pipeline in the verifier (motivating native multimodal vision-language approaches and teacher-student distillation), and open questions regarding end-to-end conversion rate and longitudinal user engagement effects. Further, while CoT improves training and robustness, real-world deployment may benefit from scalable alternatives for rationale generation.
LEAPS demonstrates the viability of a modular, reinforcement-learning-enhanced "Broaden-and-Refine" plugin architecture for real-world, conversational e-commerce search, setting a template for non-disruptive LLM augmentation of production AI search engines (Wang et al., 9 Jan 2026).