- The paper introduces a domain-adapted reward model for robust counterfactual evaluation of ads ranking models, addressing limitations of traditional methods like IPS in complex scenarios.
- The method trains the reward model using a weighted loss function that emphasizes differences between ranking policies, allowing it to generalize and estimate lift accurately across various policy domains.
- Experimental results on synthetic and real-world data demonstrate that the proposed domain-adapted model significantly outperforms baseline methods and vanilla IPS in evaluating ranking policies.
This paper introduces a domain-adapted reward model to enhance counterfactual evaluation of ads ranking models, particularly in scenarios where traditional model-free methods like IPS are impractical (2409.19824). The core innovation lies in training a reward model that generalizes across different ranking policies, facilitating accurate lift estimation within an offline A/B testing framework.
Domain-Adapted Reward Model and Offline A/B Testing System
The paper addresses the problem of selection bias inherent in large-scale recommender systems by proposing a domain-adapted reward model, h(x,a), that estimates the reward y given context x and ad a. This reward model is trained to function effectively across multiple domains, where each domain represents a specific ranking policy. The methodology leverages an offline A/B testing system, which simulates ad recommendations for each target domain using historical data. Each target domain, denoted as Tk, represents ads recommended by a specific ranking model.
Lift Estimation
The reward model facilitates the estimation of lift between a target domain (Tk) and a source domain (S). Lift is quantified as the difference in expected reward, as estimated by the reward model, between the target and source domains. This is a critical step in assessing the impact of transitioning from one ranking policy to another.
Weighted Loss Function
Training the reward model involves minimizing a weighted loss function on labeled data from the source domain (DS). The weighting scheme is designed to emphasize non-overlapping regions between target and source domains, thereby improving the model's ability to generalize across different policies. The weight, wak, is defined as the ratio of the probability of observing ad a under context x with target policy Tk to the probability under source policy S. The loss function incorporates two key terms:
- ∣wak−1∣: This term focuses on the discrepancies between target and source domains, ensuring that the reward model is sensitive to policy changes.
- βΣ∣wak−wk′_a∣: This term, regulated by the hyperparameter β, minimizes the deviation in reward model performance across different target domains. It promotes consistent performance of the reward model across all domains.
Implementation and Evaluation
The implementation integrates the domain-adapted reward model into an offline A/B testing system, allowing for a structured evaluation of different ranking policies. The process involves simulating ad recommendations for each target domain, predicting rewards using the trained reward model, calculating lift between target and source domains, and ranking policies based on their estimated lifts.
Experimental Results
The paper substantiates its claims with experimental results derived from both synthetic and real-world data.
- Synthetic Data: In a controlled synthetic environment, the proposed reward model demonstrated superior performance compared to a baseline model (trained solely on source domain data) and the vanilla IPS method. The performance metric used was Rec_cv (coefficient of variation of recovery), which measures the accuracy of recovery.
- Online Experiment (CTR Prediction): Using data from a completed A/B test for a CTR prediction model, the proposed reward model achieved a 17.6% improvement on the Rec_cv metric compared to a baseline model. The propensity score weights were estimated by training an impression probability estimator per target domain due to the intractability of directly calculating the propensity score weight in complex recommendation systems.
In summary, the paper introduces a domain-adapted reward model for counterfactual evaluation of ads ranking models, particularly in scenarios where traditional model-free methods like IPS are not feasible. The reward model is trained using a weighted loss function that emphasizes the differences between the current (source) policy and the new (target) policies. Experimental results using both synthetic and real-world data demonstrate that the proposed reward model outperforms both a baseline model and the vanilla IPS method.