- The paper establishes a unified supervision framework that prioritizes semantic relevance over raw engagement to enhance sponsored search retrieval.
- It integrates graded relevance labels, multi-channel retrieval priors, and debiased user engagement to fine-tune bi-encoder training.
- Empirical evaluations demonstrate significant gains in metrics like P@25 and NDCG@25, validating the method’s practical impact on both offline and online systems.
Motivation and Problem Statement
Deployed e-commerce retrieval systems frequently rely on user engagement signals (e.g., clicks, orders) as supervision for large-scale bi-encoder retriever training. While user engagement is efficiently collectable, it is a notoriously noisy and biased proxy for semantic relevance, especially in sponsored ad search contexts. Engagement signals often capture influences beyond semantic match—such as item popularity, promotions, presentation, or pricing—and can be structurally sparse for candidate ads due to auction mechanics, advertiser budget constraints, and ad slot limitations. This sparsity and bias present critical challenges: retrievers supervised on engagement may favor popular but less semantically relevant items and exhibit weak generalization for cold-start and long-tail inventory. Prior frameworks have attempted to filter engagement supervision with relevance-based heuristics, but typically retain engagement as the dominant training signal. This paper proposes a fundamental shift by positioning semantic relevance as the primary supervision, using engagement strictly as a preferential signal among already-relevant items (2604.07930).
Unified Supervision Framework
The central contribution is a unified supervision paradigm for bi-encoder ad retrieval, which synthesizes training targets from three heterogeneous but complementary sources:
- Graded Relevance Labels: Relevance is rated on a 5-point ordinal scale using a cascade of cross-encoder teacher models (including LLMs like Gemma and LLaMA-3) and available human annotations. Early exit based on confidence ensures computational efficiency; final labels are rescaled to [0,1].
- Multi-Channel Retrieval Prior: Rank positions from production channels are captured, normalized, and aggregated to yield a prior score and channel consensus that reflect system-level confidence and identify hard negatives (highly ranked but semantically irrelevant).
- User Engagement: Historical engagement (orders, add-to-carts, clicks, views) is aggregated, debiased, and combined via learnable weights and non-linear transformations. Critically, engagement is applied only among semantically relevant pairs to refine preferences, preventing popularity-driven irrelevance.
A formalized target is constructed, where positive and negative query-item pairs are scored with weighted combinations of these signals. Positives utilize relevance, rank prior, channel consensus, and a gated engagement boost. Negatives are prioritized using hard-negative mining—focusing on both deceptive (highly ranked) and lexically similar but irrelevant items.
Model Training and Losses
A MiniLM-based bi-encoder architecture is fine-tuned under this unified supervision. Contrastive losses (via Cached Multiple Negatives Ranking) and cosine similarity loss are both employed. Hard-negative mining is supported by curriculum-based sampling, emphasizing challenging negatives as determined by the unified scoring function.
Empirical Evaluation
Quantitative Results
Offline evaluation is conducted across 30,303 queries spanning both head and tail segments, with top-25 retrieval evaluated for sponsored ad slot relevancy. Key findings:
- Relevance-Only Supervision: Substantially outperforms production, improving P@25 from 0.794 to 0.873 (+10.0%), NDCG@25 from 0.867 to 0.913 (+5.4%), and average relevance from 3.040 to 3.263 (+7.3%).
- Unified Supervision (Relevance+Engagement): Additional gains are realized: P@25 to 0.877 (+10.5%), NDCG@25 to 0.916 (+5.7%), and average relevance to 3.277 (+7.8%). This shows that incorporating engagement preferentially among relevant candidates improves ordering without diluting semantic quality.
A/B testing on live traffic indicates significant positive lifts in business/engagement metrics, such as impressions (+0.60%, p=0.03) and add-to-cart rate (+0.99%, p=0.009).
Figure 1: Engagement supervision increases the share of highly engaged items retrieved in the Top 25 without relevance degradation.
Figure 1 demonstrates the percentage point gains for highly engaged items in the Top-25 set, underscoring that engagement-aware supervision reliably concentrates preferred items at operational positions while maintaining strict relevance filters.
Qualitative Results
Analysis of sampled queries shows the unified supervision approach both resolves off-intent retrievals and promotes highly engaged options among plausible matches: under“cetaphil baby lotion,” the production system returns an irrelevant (“wash”) product, while relevance+engagement surfaces the intended (“lotion”) item with maximal engagement. The model successfully disambiguates brands and dietary preferences, and surfaces items favored by real user behavior.
Figure 2: Green highlights: phrases aligned with intent; red highlights: failures. Unified supervision retrieves both relevant and highly-engaged items; Relevance-only retrieves relevant but unpopular items.
Figure 2 illustrates that the unified approach balances semantic alignment and empirical preference, rectifying both relevance and engagement deficiencies.
Implications and Future Directions
This work demonstrates that for sponsored search retrieval, relevance-centric supervision with engagement as a secondary within-relevance discriminator yields robust, scalable retrieval models that generalize effectively under sparsity and exposure bias. The integration of production-channel priors serves not only as a mechanism for reinforced positives but, crucially, as a conduit for efficient hard-negative mining—guiding the model to address extant production failure modes.
Practically, these results indicate that production-scale business metrics in high-volume e-commerce can be improved by decoupling engagement from relevance in model supervision, rather than conflating the two. This paradigm enables retrieval systems to align with downstream ranking and business objectives without the detrimental effect of promoting purely popular but semantically irrelevant items.
Theoretically, the unified framework opens up avenues for curriculum learning, distillation from richer teacher architectures, and targeted intervention with new forms of engagement (e.g., dwell time, post-click satisfaction, or cross-session behavior). Ongoing research may extend these principles to unified retrieval-ranking cascades, multi-modal supervision, or even zero-shot/adaptive retrieval systems leveraging foundation models.
Conclusion
This paper establishes unified, relevance-primary supervision as an effective solution for sponsored search retrieval under sparse and biased engagement. Composite targets integrating relevance, retrieval priors, and engagement yield retrieval models that are both semantically robust and preference-aligned, achieving consistent improvements over state-of-the-art production baselines in both offline and online evaluations. The general framework is immediately extensible to other large-scale retrieval and recommendation environments, promising enhanced downstream outcomes through principled supervision design.