Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 85 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 123 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

AIGB-Pearl: Generative Auto-Bidding

Updated 26 September 2025
  • The paper introduces a non-bootstrapped trajectory evaluator that replaces traditional bootstrapped critics with a supervised model leveraging LLM-derived embeddings.
  • It employs a conditional diffusion model with policy gradients to optimize bidding trajectories while remaining regularized to offline data.
  • Empirical results demonstrate significant improvements in GMV, ROI, and stability in large-scale advertising systems during both simulation and online tests.

AIGB-Pearl is an advanced framework for generative auto-bidding, developed to overcome the limitations of trajectory generation quality evaluation and exploration in static offline datasets. Building on the conditional diffusion modeling paradigm introduced in prior AIGB work, AIGB-Pearl integrates a non-bootstrapped trajectory evaluator with generative planning and policy optimization to deliver enhanced stability and performance in large-scale advertising systems (Mou et al., 19 Sep 2025).

1. Motivation and Overview

Standard AIGB methods model auto-bidding as a trajectory generation problem, training a conditional diffusion planner to imitate historical bidding data. While effective, these methods are constrained by two bottlenecks: the inability to systematically explore beyond the support of the offline dataset, and the absence of fine-grained evaluator signals for generation quality. This often impedes the discovery and optimization of higher-quality bidding strategies, particularly when off-policy RL critics are used, as they introduce instability due to bootstrapping and off-policy learning.

AIGB-Pearl ("Planning with EvAluator via RL") addresses this challenge by explicitly constructing a supervised, non-bootstrapped trajectory evaluator to assign quality scores to generated trajectories. The planner is then trained via policy search, directly maximizing the evaluator's score while remaining regularized toward the offline data distribution.

2. Non-Bootstrapped Trajectory Evaluator

The trajectory evaluator replaces the bootstrapped value function commonly used in offline RL. It is trained via supervised learning on offline data, mapping generated bidding trajectories τ\tau to a scalar quality score y^ϕ(τ)\hat{y}_\phi(\tau) (e.g., reflecting total reward or a domain-specific utility metric).

The evaluator differs fundamentally from bootstrapped critics:

  • Parameter Fixing: Its parameters are frozen during planner training, so the planner's policy update loop avoids the shifting target problems that plague iterative RL.
  • Supervised Losses: Quality signals are learned using ground-truth (or proxy) rewards rather than target value functions.
  • Architectural Innovations: The evaluator employs a Causal Transformer to model tokenized bidding sequences, enhanced with LLM-derived contextual embeddings that incorporate side information (product titles, categories, reviews) to more accurately represent campaign properties relevant to bidding effectiveness.
  • Hybrid Losses: Evaluation combines point-wise (MSE) regression for absolute score accuracy, and pair-wise loss (via the Bradley–Terry model) for accurate ranking between bidding trajectories:

Lpoint=EτD[(y^ϕ1org(τ)y(τ))2]\mathcal{L}_{\text{point}} = \mathbb{E}_{\tau \sim \mathcal{D}} \left[ ( \hat{y}^{\text{org}}_{\phi_1}(\tau) - y(\tau) )^2 \right]

Lpair=E(τw,τl)Dp[logσ(y^ϕ1org(τw)y^ϕ1org(τl))]\mathcal{L}_{\text{pair}} = \mathbb{E}_{(\tau_w, \tau_l) \sim \mathcal{D}_p} [ -\log \sigma ( \hat{y}^{\text{org}}_{\phi_1}(\tau_w) - \hat{y}^{\text{org}}_{\phi_1}(\tau_l) ) ]

  • Expert Feedback Integration: A parallel expert score head y^ϕ2exp(τ)\hat{y}^{\text{exp}}_{\phi_2}(\tau) is trained to penalize "bad" trajectories based on extra expert labels, accounting for cost distribution, pacing, and domain-specific constraints. The final evaluator score is multiplicative: y^ϕ(τ)=y^ϕ1org(τ)×y^ϕ2exp(τ)\hat{y}_\phi(\tau) = \hat{y}^{\text{org}}_{\phi_1}(\tau) \times \hat{y}^{\text{exp}}_{\phi_2}(\tau).

3. Generative Planner and Policy Optimization

The planner is a conditional diffusion model, as in previously established DiffBid (Guo et al., 25 May 2024), generating trajectory samples τ\tau conditioned on global constraints or desired returns yy^*. Unlike imitation-based AIGB, AIGB-Pearl uses policy gradients to optimize the planner with respect to the frozen evaluator's estimates.

The optimization objective is:

maxθ  Eτpθ(τy)[y^ϕ(τ)]\max_\theta \; \mathbb{E}_{\tau \sim p_\theta(\tau | y^*)} [ \hat{y}_\phi(\tau) ]

During diffusion process training, the policy gradient can be written as:

θEτpθ(τy)[y^ϕ(τ)]=Eτ1:Kpθ(τ1:Ky)[kθlogpθ(τkτk1,y)y^ϕ(τK)]\nabla_\theta \mathbb{E}_{\tau \sim p_\theta(\tau | y^*)} [ \hat{y}_\phi(\tau) ] = \mathbb{E}_{\tau_{1:K} \sim p_\theta(\tau_{1:K} | y^*)} \left[ \sum_{k} \nabla_\theta \log p_\theta(\tau_k | \tau_{k-1}, y^*) \cdot \hat{y}_\phi(\tau_K) \right]

A conservative regularization term enforces proximity to offline data:

maxθ[Eτpθ(τy)[y^ϕ(τ)]+β2E(τ,y(τ))D[logpθ(τy(τ))]]\max_\theta \left[ \mathbb{E}_{\tau \sim p_\theta(\tau | y^*)} [ \hat{y}_\phi(\tau) ] + \beta_2 \cdot \mathbb{E}_{(\tau, y(\tau)) \sim \mathcal{D}} [ \log p_\theta(\tau | y(\tau)) ] \right]

This mitigates extrapolation risk, ensuring the planner does not diverge excessively from historical data.

4. Enhancements to Evaluator Reliability

To increase evaluator generalization and accuracy:

  • LLM Embeddings: Prompts containing advertiser or campaign details are encoded with a pre-trained LLM and supplied as embeddings to the Causal Transformer, enabling multimodal context fusion.
  • Hybrid Loss Balancing: Point- and pair-wise losses are combined with a weighting parameter β1\beta_1, balancing absolute value estimation and ranking.
  • Expert Supervision: Binary expert labels are ingested via cross-entropy loss, reducing overestimation on out-of-distribution or pathological trajectories.

These techniques jointly enhance the evaluator's ability to assign meaningful scores across both the support of D\mathcal{D} and in moderately out-of-distribution regions reached by planner exploration.

5. Empirical Results and Impact

Experiments demonstrate that AIGB-Pearl achieves substantial improvements over both conventional offline RL (USCB, BCQ, CQL, IQL) and prior AIGB (DiffBid, DT) methods:

  • Simulated Environment: In offline controlled settings, AIGB-Pearl yields higher GMV and better generalization to advertisers not seen in the offline dataset, showing robust policy learning.
  • Online Deployment: In A/B tests on Taobao, observable improvements in GMV, impression count (BuyCnt), and ROI are realized, with cost rates kept stable (within a 2% deviation). Online rate and bad-case rate metrics further validate enhanced stability and lower risk of failure modes.

The planner-evaluator loop offered by AIGB-Pearl consistently produces higher-quality bidding plans that exceed the mere imitation capabilities of past approaches, demonstrating state-of-the-art performance.

6. Practical and Methodological Significance

AIGB-Pearl delivers several methodological advances:

  • Non-bootstrapped Supervised Evaluation: Decouples reward estimation from policy iteration instability, avoiding value function drift and bootstrapping pathologies.
  • Stable Policy Search for Generative Planning: Policy gradients delivered by the frozen evaluator allow reliable improvement of the conditional diffusion planner, supporting both exploitation (score maximization) and conservation (offline regularization).
  • Enhanced Representational Power: LLM-derived context embeddings and hybrid objective losses yield an evaluator that reflects both intrinsic and domain-specific utility.
  • Domain Adaptability: The modular evaluator architecture can incorporate various forms of domain feedback (including human expertise), making it extensible to other high-stakes domains beyond advertising.

A plausible implication is that frameworks following this pattern—offline-trained deterministic evaluators coupled with generative policy optimization—may become foundational in high-consequence, data-rich domains where static imitation is insufficient.

7. Conclusion

AIGB-Pearl systematically advances the practical deployment of generative auto-bidding. By fusing conditional diffusion-based planning with an accurate, domain-informed evaluator and stable policy search, it demonstrates superior outcomes both offline and online, marking a new methodological standard for reward-guided, stable learning in large-scale advertising systems (Mou et al., 19 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to AIGB-Pearl.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube