AIGB-Pearl: Generative Auto-Bidding

Updated 26 September 2025

The paper introduces a non-bootstrapped trajectory evaluator that replaces traditional bootstrapped critics with a supervised model leveraging LLM-derived embeddings.
It employs a conditional diffusion model with policy gradients to optimize bidding trajectories while remaining regularized to offline data.
Empirical results demonstrate significant improvements in GMV, ROI, and stability in large-scale advertising systems during both simulation and online tests.

AIGB-Pearl is an advanced framework for generative auto-bidding, developed to overcome the limitations of trajectory generation quality evaluation and exploration in static offline datasets. Building on the conditional diffusion modeling paradigm introduced in prior AIGB work, AIGB-Pearl integrates a non-bootstrapped trajectory evaluator with generative planning and policy optimization to deliver enhanced stability and performance in large-scale advertising systems (Mou et al., 19 Sep 2025).

1. Motivation and Overview

Standard AIGB methods model auto-bidding as a trajectory generation problem, training a conditional diffusion planner to imitate historical bidding data. While effective, these methods are constrained by two bottlenecks: the inability to systematically explore beyond the support of the offline dataset, and the absence of fine-grained evaluator signals for generation quality. This often impedes the discovery and optimization of higher-quality bidding strategies, particularly when off-policy RL critics are used, as they introduce instability due to bootstrapping and off-policy learning.

AIGB-Pearl ("Planning with EvAluator via RL") addresses this challenge by explicitly constructing a supervised, non-bootstrapped trajectory evaluator to assign quality scores to generated trajectories. The planner is then trained via policy search, directly maximizing the evaluator's score while remaining regularized toward the offline data distribution.

2. Non-Bootstrapped Trajectory Evaluator

The trajectory evaluator replaces the bootstrapped value function commonly used in offline RL. It is trained via supervised learning on offline data, mapping generated bidding trajectories $\tau$ to a scalar quality score $\hat{y}_\phi(\tau)$ (e.g., reflecting total reward or a domain-specific utility metric).

The evaluator differs fundamentally from bootstrapped critics:

Parameter Fixing: Its parameters are frozen during planner training, so the planner's policy update loop avoids the shifting target problems that plague iterative RL.
Supervised Losses: Quality signals are learned using ground-truth (or proxy) rewards rather than target value functions.
Architectural Innovations: The evaluator employs a Causal Transformer to model tokenized bidding sequences, enhanced with LLM-derived contextual embeddings that incorporate side information (product titles, categories, reviews) to more accurately represent campaign properties relevant to bidding effectiveness.
Hybrid Losses: Evaluation combines point-wise (MSE) regression for absolute score accuracy, and pair-wise loss (via the Bradley–Terry model) for accurate ranking between bidding trajectories:

$\mathcal{L}_{\text{point}} = \mathbb{E}_{\tau \sim \mathcal{D}} \left[ ( \hat{y}^{\text{org}}_{\phi_1}(\tau) - y(\tau) )^2 \right]$

$\mathcal{L}_{\text{pair}} = \mathbb{E}_{(\tau_w, \tau_l) \sim \mathcal{D}_p} [ -\log \sigma ( \hat{y}^{\text{org}}_{\phi_1}(\tau_w) - \hat{y}^{\text{org}}_{\phi_1}(\tau_l) ) ]$

Expert Feedback Integration: A parallel expert score head $\hat{y}^{\text{exp}}_{\phi_2}(\tau)$ is trained to penalize "bad" trajectories based on extra expert labels, accounting for cost distribution, pacing, and domain-specific constraints. The final evaluator score is multiplicative: $\hat{y}_\phi(\tau) = \hat{y}^{\text{org}}_{\phi_1}(\tau) \times \hat{y}^{\text{exp}}_{\phi_2}(\tau)$ .

3. Generative Planner and Policy Optimization

The planner is a conditional diffusion model, as in previously established DiffBid (Guo et al., 25 May 2024), generating trajectory samples $\tau$ conditioned on global constraints or desired returns $y^*$ . Unlike imitation-based AIGB, AIGB-Pearl uses policy gradients to optimize the planner with respect to the frozen evaluator's estimates.

The optimization objective is:

$\max_\theta \; \mathbb{E}_{\tau \sim p_\theta(\tau | y^*)} [ \hat{y}_\phi(\tau) ]$

During diffusion process training, the policy gradient can be written as:

$\nabla_\theta \mathbb{E}_{\tau \sim p_\theta(\tau | y^*)} [ \hat{y}_\phi(\tau) ] = \mathbb{E}_{\tau_{1:K} \sim p_\theta(\tau_{1:K} | y^*)} \left[ \sum_{k} \nabla_\theta \log p_\theta(\tau_k | \tau_{k-1}, y^*) \cdot \hat{y}_\phi(\tau_K) \right]$

A conservative regularization term enforces proximity to offline data:

$\max_\theta \left[ \mathbb{E}_{\tau \sim p_\theta(\tau | y^*)} [ \hat{y}_\phi(\tau) ] + \beta_2 \cdot \mathbb{E}_{(\tau, y(\tau)) \sim \mathcal{D}} [ \log p_\theta(\tau | y(\tau)) ] \right]$

This mitigates extrapolation risk, ensuring the planner does not diverge excessively from historical data.

4. Enhancements to Evaluator Reliability

To increase evaluator generalization and accuracy:

LLM Embeddings: Prompts containing advertiser or campaign details are encoded with a pre-trained LLM and supplied as embeddings to the Causal Transformer, enabling multimodal context fusion.
Hybrid Loss Balancing: Point- and pair-wise losses are combined with a weighting parameter $\beta_1$ , balancing absolute value estimation and ranking.
Expert Supervision: Binary expert labels are ingested via cross-entropy loss, reducing overestimation on out-of-distribution or pathological trajectories.

These techniques jointly enhance the evaluator's ability to assign meaningful scores across both the support of $\mathcal{D}$ and in moderately out-of-distribution regions reached by planner exploration.

5. Empirical Results and Impact

Experiments demonstrate that AIGB-Pearl achieves substantial improvements over both conventional offline RL (USCB, BCQ, CQL, IQL) and prior AIGB (DiffBid, DT) methods:

Simulated Environment: In offline controlled settings, AIGB-Pearl yields higher GMV and better generalization to advertisers not seen in the offline dataset, showing robust policy learning.
Online Deployment: In A/B tests on Taobao, observable improvements in GMV, impression count (BuyCnt), and ROI are realized, with cost rates kept stable (within a 2% deviation). Online rate and bad-case rate metrics further validate enhanced stability and lower risk of failure modes.

The planner-evaluator loop offered by AIGB-Pearl consistently produces higher-quality bidding plans that exceed the mere imitation capabilities of past approaches, demonstrating state-of-the-art performance.

6. Practical and Methodological Significance

AIGB-Pearl delivers several methodological advances:

Non-bootstrapped Supervised Evaluation: Decouples reward estimation from policy iteration instability, avoiding value function drift and bootstrapping pathologies.
Stable Policy Search for Generative Planning: Policy gradients delivered by the frozen evaluator allow reliable improvement of the conditional diffusion planner, supporting both exploitation (score maximization) and conservation (offline regularization).
Enhanced Representational Power: LLM-derived context embeddings and hybrid objective losses yield an evaluator that reflects both intrinsic and domain-specific utility.
Domain Adaptability: The modular evaluator architecture can incorporate various forms of domain feedback (including human expertise), making it extensible to other high-stakes domains beyond advertising.

A plausible implication is that frameworks following this pattern—offline-trained deterministic evaluators coupled with generative policy optimization—may become foundational in high-consequence, data-rich domains where static imitation is insufficient.

7. Conclusion

AIGB-Pearl systematically advances the practical deployment of generative auto-bidding. By fusing conditional diffusion-based planning with an accurate, domain-informed evaluator and stable policy search, it demonstrates superior outcomes both offline and online, marking a new methodological standard for reward-guided, stable learning in large-scale advertising systems (Mou et al., 19 Sep 2025).

PDF Markdown Chat (Pro)

References (2)

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search (2025)

AIGB: Generative Auto-bidding via Conditional Diffusion Modeling (2024)

Follow Topic

Get notified by email when new papers are published related to AIGB-Pearl.