DARA: Few-shot Budget Allocation in Online Advertising via In-Context Decision Making with RL-Finetuned LLMs

Published 21 Jan 2026 in cs.AI and cs.LG | (2601.14711v1)

Abstract: Optimizing the advertiser's cumulative value of winning impressions under budget constraints poses a complex challenge in online advertising, under the paradigm of AI-Generated Bidding (AIGB). Advertisers often have personalized objectives but limited historical interaction data, resulting in few-shot scenarios where traditional reinforcement learning (RL) methods struggle to perform effectively. LLMs offer a promising alternative for AIGB by leveraging their in-context learning capabilities to generalize from limited data. However, they lack the numerical precision required for fine-grained optimization. To address this limitation, we introduce GRPO-Adaptive, an efficient LLM post-training strategy that enhances both reasoning and numerical precision by dynamically updating the reference policy during training. Built upon this foundation, we further propose DARA, a novel dual-phase framework that decomposes the decision-making process into two stages: a few-shot reasoner that generates initial plans via in-context prompting, and a fine-grained optimizer that refines these plans using feedback-driven reasoning. This separation allows DARA to combine LLMs' in-context learning strengths with precise adaptability required by AIGB tasks. Extensive experiments on both real-world and synthetic data environments demonstrate that our approach consistently outperforms existing baselines in terms of cumulative advertiser value under budget constraints.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces DARA, a dual-phase framework that combines few-shot in-context learning with RL fine-tuning to optimize sequential budget allocations in online advertising.
It decomposes decision-making into a heuristic initialization phase and a fine-grained optimizer phase, leading to significantly reduced marginal ROI variance compared to baseline models.
The GRPO-Adaptive algorithm enhances numerical precision and stability, ensuring robust performance in real-time bidding environments with limited data.

Dual-Phase RL-Finetuned LLMs for Few-Shot Budget Allocation in Online Advertising

Introduction

The problem of optimally allocating advertising budgets across multiple time periods in real-time bidding (RTB) environments entails maximizing cumulative advertiser value subject to budget constraints, under highly dynamic and data-limited conditions. Existing approaches based on RL or LLMs are limited by either sample inefficiency, poor adaptability in few-shot settings, or lack of numerical precision in structured optimization. The paper "DARA: Few-shot Budget Allocation in Online Advertising via In-Context Decision Making with RL-Finetuned LLMs" (2601.14711) introduces DARA—a Dual-phase Adaptive Reasoning and Allocation framework—addressing these limitations through a novel decomposition of the planning process and a synergistic combination of few-shot in-context learning and RL-driven optimization.

Problem Formulation and Environment Modeling

The core challenge is formulated as a sequential decision problem with delayed rewards and non-stationary dynamics. Budget allocation vectors $b_t$ across $T$ periods are optimized to maximize the sum of expected returns $v_t(b_t)$ , under the constraint $\sum_t b_t = B$ , with each $v_t$ assumed strictly increasing and concave. The optimality condition at solution enforces equal marginal ROI across periods. Practically, this is approximated by minimizing marginal ROI variance.

Figure 1: Empirical marginal ROI curves in the real-world data environment characterize the allocation-reward tradeoff inherent in online advertising platforms.

The study leverages both enterprise-scale, real-world environments and synthetic polynomial function-based simulators, supporting robust benchmarking and systematic testing of the capacity to generalize under different regimes.

Dual-Phase Cooperative Agent Architectural Design

Single-stage LLM architectures are empirically shown to be insufficient for tasks requiring simultaneous generalization and fine-grained adaptation. DARA introduces a dual-phase paradigm:

Few-shot Reasoner: Utilizes structured in-context prompts encoding historical episode data to generate initial allocation heuristics. This agent leverages the LLM’s inductive priors and compositional reasoning for rapid few-shot adaptation.
Fine-grained Optimizer: Refines the preliminary allocation iteratively, exploiting feedback-driven optimization and marginal ROI signals from recent episodes. This agent updates allocations within a sliding window, focusing on local numerical optimization and reward balancing.
Figure 2: The DARA training workflow illustrates the separation of strategic initialization and subsequent fine-grained feedback-based adaptation.

This separation of concerns enables both rapid cold-start generalization and precise, numerically sensitive adjustment in evolving online markets.

RL Fine-tuning: GRPO-Adaptive Algorithm

To address LLMs' limitations in numerical precision and reasoning, the paper introduces GRPO-Adaptive—an extension of GRPO that periodically refreshes the reference policy used in KL regularization. Instead of a static anchor, the baseline is updated every $M$ steps, coupling adaptability and stability in the optimization landscape.

The loss function integrates clipped advantage-weighted PPO-style optimization with dynamic KL regularization. This mechanism suppresses policy drift while allowing progressive improvement:

$\mathcal{L}_{\text{GRPO-A}}(\theta) = -\mathcal{L}_{\text{adv}}(\theta) + \beta\, \mathbb{D}_{\text{KL}}[\pi_{\theta} \| \pi_{\text{ref}}]$

Reward design incorporates environment reward for marginal ROI balancing, constraint penalties for action validity, and bonus shaping for adaptive refinement based on prior performance. Controlled multi-environment sampling procedures in training prevent overfitting and promote robust generalization.

Experimental Results and Performance Evaluation

The framework is evaluated against several baselines, including ABPlanner', HiBid', Q-MCKP, DPO, and GPT-4o. Across all experiments, DARA demonstrates consistently lower marginal ROI variance than alternatives, indicating superior allocation stability and balanced exploitation of budget resources.

Figure 3: DARA achieves substantial reductions in marginal ROI variance, outperforming all baseline algorithms in dynamic online budget allocation.

Ablation studies decompose performance gains:

Single LLM baseline exhibits poor adaptation.
RL fine-tuning improves precision but remains constrained by architectural bottlenecks.
Dual-phase architecture boosts consistency.
Full DARA (dual-phase + RL fine-tuning) achieves the lowest variance, confirming the advantage of synergistic decomposition and RL-enabled specialization.
Figure 4: The dual-phase architecture combined with RL fine-tuning yields the lowest observed marginal ROI variance, indicating strong specialization and feedback integration.

Sensitivity analyses show robustness across temporal granularities (number of budget periods) and demonstrate the critical role of KL reference update frequency: intermediate refresh intervals (e.g., $M=60$ ) maximize stability and convergence rates, while static or excessively frequent updates lead to degraded or unstable performance.

Figure 5: DARA consistently outperforms the strongest baseline across different temporal granularity settings, with improvements observed in all configurations.

Figure 6: KL reference refresh frequency modulates training stability; periodic updates enable more adaptive and aligned policy optimization.

Implications and Future Directions

The DARA framework offers a modular approach for online budget allocation in scenarios dominated by few-shot conditions, diverse advertiser goals, and volatile environments. The explicit task decomposition into reasoning and adaptation agents aligns well with heterogeneous cognitive demands. RL fine-tuning with dynamic baseline regularization provides a path for overcoming LLM limitations in structured optimization and feedback integration.

Practically, DARA’s architecture is deployable in large-scale online advertising systems, enabling interpretability, sample-efficient adaptation, and stability under evolving auction conditions. Theoretically, the framework opens inquiry into broader multi-agent LLM ensembles, reinforcement learning for decision-theoretic symbolic reasoning, and advanced regularization schemes for policy alignment.

Future work may extend these paradigms to cross-channel, multi-agent budget optimization, adversarial scenarios, and increased interaction between human-in-the-loop feedback and automated model refinement, as RLHF and LLM capabilities further mature.

Conclusion

"DARA: Few-shot Budget Allocation in Online Advertising via In-Context Decision Making with RL-Finetuned LLMs" presents a dual-phase hybrid architecture that robustly balances few-shot generalization with fine-grained numerical optimization. The GRPO-Adaptive RL fine-tuning strategy effectively stabilizes learning trajectories while driving continual policy improvement. The empirical results substantiate strong claims on variance reduction and adaptation. This work represents a critical step toward integrating structured RL objectives with LLM reasoning for decision optimization in realistic, low-data industrial domains.

Markdown Report Issue