Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 102 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 30 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 110 tok/s
GPT OSS 120B 475 tok/s Pro
Kimi K2 203 tok/s Pro
2000 character limit reached

Counterfactual Evaluation of Ads Ranking Models through Domain Adaptation (2409.19824v1)

Published 29 Sep 2024 in cs.IR and cs.AI

Abstract: We propose a domain-adapted reward model that works alongside an Offline A/B testing system for evaluating ranking models. This approach effectively measures reward for ranking model changes in large-scale Ads recommender systems, where model-free methods like IPS are not feasible. Our experiments demonstrate that the proposed technique outperforms both the vanilla IPS method and approaches using non-generalized reward models.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a domain-adapted reward model for robust counterfactual evaluation of ads ranking models, addressing limitations of traditional methods like IPS in complex scenarios.
  • The method trains the reward model using a weighted loss function that emphasizes differences between ranking policies, allowing it to generalize and estimate lift accurately across various policy domains.
  • Experimental results on synthetic and real-world data demonstrate that the proposed domain-adapted model significantly outperforms baseline methods and vanilla IPS in evaluating ranking policies.

This paper introduces a domain-adapted reward model to enhance counterfactual evaluation of ads ranking models, particularly in scenarios where traditional model-free methods like IPS are impractical (2409.19824). The core innovation lies in training a reward model that generalizes across different ranking policies, facilitating accurate lift estimation within an offline A/B testing framework.

Domain-Adapted Reward Model and Offline A/B Testing System

The paper addresses the problem of selection bias inherent in large-scale recommender systems by proposing a domain-adapted reward model, h(x,a)h(x, a), that estimates the reward y given context x and ad a. This reward model is trained to function effectively across multiple domains, where each domain represents a specific ranking policy. The methodology leverages an offline A/B testing system, which simulates ad recommendations for each target domain using historical data. Each target domain, denoted as TkT_k, represents ads recommended by a specific ranking model.

Lift Estimation

The reward model facilitates the estimation of lift between a target domain (TkT_k) and a source domain (SS). Lift is quantified as the difference in expected reward, as estimated by the reward model, between the target and source domains. This is a critical step in assessing the impact of transitioning from one ranking policy to another.

Weighted Loss Function

Training the reward model involves minimizing a weighted loss function on labeled data from the source domain (DSD_S). The weighting scheme is designed to emphasize non-overlapping regions between target and source domains, thereby improving the model's ability to generalize across different policies. The weight, wakw^k_a, is defined as the ratio of the probability of observing ad a under context x with target policy TkT_k to the probability under source policy SS. The loss function incorporates two key terms:

  • wak1|w^k_a - 1|: This term focuses on the discrepancies between target and source domains, ensuring that the reward model is sensitive to policy changes.
  • βΣwakwk_aβ Σ |w^k_a - w^{k'}\_a|: This term, regulated by the hyperparameter β, minimizes the deviation in reward model performance across different target domains. It promotes consistent performance of the reward model across all domains.

Implementation and Evaluation

The implementation integrates the domain-adapted reward model into an offline A/B testing system, allowing for a structured evaluation of different ranking policies. The process involves simulating ad recommendations for each target domain, predicting rewards using the trained reward model, calculating lift between target and source domains, and ranking policies based on their estimated lifts.

Experimental Results

The paper substantiates its claims with experimental results derived from both synthetic and real-world data.

  • Synthetic Data: In a controlled synthetic environment, the proposed reward model demonstrated superior performance compared to a baseline model (trained solely on source domain data) and the vanilla IPS method. The performance metric used was Rec_cvRec\_cv (coefficient of variation of recovery), which measures the accuracy of recovery.
  • Online Experiment (CTR Prediction): Using data from a completed A/B test for a CTR prediction model, the proposed reward model achieved a 17.6% improvement on the Rec_cvRec\_cv metric compared to a baseline model. The propensity score weights were estimated by training an impression probability estimator per target domain due to the intractability of directly calculating the propensity score weight in complex recommendation systems.

In summary, the paper introduces a domain-adapted reward model for counterfactual evaluation of ads ranking models, particularly in scenarios where traditional model-free methods like IPS are not feasible. The reward model is trained using a weighted loss function that emphasizes the differences between the current (source) policy and the new (target) policies. Experimental results using both synthetic and real-world data demonstrate that the proposed reward model outperforms both a baseline model and the vanilla IPS method.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube