ABC-RL: Adaptive Behavioral Costs in RL

Updated 20 November 2025

ABC-RL is a framework that combines adaptive behavioral cost shaping and likelihood-free Bayesian approaches to address non-naturalistic behaviors in RL agents.
It employs dynamic penalty weighting and surrogate optimization to balance reward maximization with human-like policy constraints effectively.
The method demonstrates robust performance in environments like Unity ML-Agents and DMLab-30, achieving near-baseline returns with minimized behavioral costs.

Adaptive Behavioral Costs in Reinforcement Learning (ABC-RL) refers to a family of methodologies designed to address non-naturalistic behavior in reinforcement learning (RL) agents or to enable likelihood-free, simulation-based Bayesian RL. The nomenclature ABC-RL is associated with two distinct but prominent lines of research: (1) adaptive behavioral cost shaping to produce human-like policies in deep RL (Ho et al., 2023), and (2) Approximate Bayesian Computation for reinforcement learning, which enables effective policy learning with black-box dynamics simulators (Dimitrakakis et al., 2013). Both approaches provide novel solutions to longstanding challenges in reinforcement learning, either behavioral or algorithmic.

1. Formalism and Problem Statement

Adaptive Behavioral Cost RL

The primary objective in adaptive behavioral cost RL is to train agents that not only maximize standard return but also behave in ways that are human-like or otherwise desirable in terms of motion or strategy (Ho et al., 2023). The core problem is formulated as:

$\min_{\theta} \; J_c(\pi_\theta) \quad\text{subject to}\quad J_v(\pi_\theta) \ge V_{\rm th}$

where:

$J_v(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^\infty \gamma^t r_t\right]$ is the conventional discounted return,
$J_c(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^\infty \gamma^t C(t)\right]$ denotes the discounted behavioral cost,
$V_{\rm th}$ is a chosen threshold, typically set as a proportion of the unconstrained optimum (e.g., 80%).

The approach employs an augmented Lagrangian to solve the constrained problem, incorporating multipliers and penalty coefficients, and ultimately yielding a surrogate objective that allows for standard RL optimization with dynamic regularization.

Approximate Bayesian Computation RL

ABC-RL in the Bayesian context seeks robust policy learning without explicit likelihoods or tractable transition models (Dimitrakakis et al., 2013). Given a parameterized family of environment simulators $\mathcal{M}_\theta$ with prior $\beta(\theta)$ , the Bayesian-RL objective is expressed as:

$\pi^*_{\beta} = \arg\max_\pi \mathbb{E}_{\mathcal{M} \sim \beta} \mathbb{E}_{\mathbb{P}_{\mathcal{M}}^{\pi}} \left[U \right]$

where $U$ is the discounted utility, and learning proceeds by constructing an approximate posterior over $\mathcal{M}$ using ABC sampling: models that generate trajectories sufficiently similar (in terms of summary statistics $f$ ) to observed history are retained.

2. Methodologies

a. Behavioral Cost Penalization and Adaptive Weighting

ABC-RL penalizes undesirable behaviors by assigning interpretable costs to specific agent actions—chiefly, “shaking” and “spinning” in 3D video game environments:

Shaking cost $C_{\rm sh}(t)$ is the normalized count of left-right reversals in a sliding window.
Spinning cost $C_{\rm sp}(t)$ is the count of completed 360° rotations. The total behavioral cost per step is $C(t) = C_{\rm sh}(t) + \alpha C_{\rm sp}(t)$ , with $\alpha$ adjustable.

Adaptive weighting is central: the penalty applied to the behavioral cost is not static. In the sigmoid-variant ABC-RL, the penalty $\Lambda_t$ is set online via a smooth sigmoid function of the difference between recent average return $V_{\rm avg}$ and the threshold $V_{\rm th}$ : $\Lambda_t = W \sigma\left(\frac{V_{\rm avg} - V_{\rm th}}{h}\right)$ where $W$ is the maximal penalty and $h$ the “temperature.” This ensures that as performance approaches or exceeds the threshold, behavioral constraints are more stringently enforced, but if the agent's performance dips, penalties are relaxed automatically.

b. ABC Posterior Sampling and Likelihood-Free RL

In likelihood-free Bayesian RL, the acceptance of a simulator model $\mathcal{M}^{(k)}$ is determined by comparing summary statistics $f(h)$ of real and simulated trajectories and accepting if $\|f(h) - f(h^{(k)})\| \le \epsilon$ . Posterior sampling is straightforward and implementation-agnostic, admitting modular integration with standard Bayesian RL planning frameworks such as Thompson sampling or rollout policy improvement.

Algorithmically, once a set of acceptable models $\widehat{\mathcal{M}}$ is identified, the agent either samples a model from this set for planning or averages utilities across the set: $V_{\beta_\epsilon}(\pi) = \mathbb{E}_{\mathcal{M} \sim \beta_\epsilon}\; \mathbb{E}_{\mathbb{P}_{\mathcal{M}}^\pi}[U]$ with $\beta_\epsilon$ the ABC indirect posterior.

3. Experimental Protocol and Metrics

Adaptive Behavioral Cost Experiments

Experiments focus on deep RL in 3D environments:

Unity ML-Agents “Banana Collector” and four DMLab-30 tasks (e.g., rooms_keys_doors_puzzle).
Policies and value functions implemented as CNN-MLP hybrids; optimization is via PPO.
Metrics: episodic returns, per-step shaking/spinning costs, with human performance as baseline.
Comparative setups: unconstrained PPO, fixed-penalty (“Const”), dynamic dual update (AB-CPO), and the sigmoid-surrogate ABC-RL.

ABC-RL Bayesian Experiments

Demonstrations use standard continuous-state, discrete-action domains:

Mountain Car and Pendulum with unknown parameters and uniform priors.
Methods compared: LSPI (direct batch RL) versus ABC-LSPI (LSPI with ABC-sampled models).
Sampling protocols: ABC step to posteriors with acceptance threshold $\epsilon$ , then policy improvement/rollouts.
Metrics: average returns per episode across $10^3$ episodes for $100$ independent runs.

Environment	Agent Return	Shaking Cost	Spinning Cost
PPO Baseline	$+8.5$	$0.26$	$0.12$
ABC-RL / AB-CPO	$+8.2$	$<0.05$	$<0.01$
Const	$+6.0$	$0.02$	$0.10$
Human	$+6.5$	$0.05$	$0.01$

Interpretation: ABC-RL approaches human-level behavioral cost at super-human performance levels, in contrast to unconstrained or static-penalty methods (Ho et al., 2023).

4. Results, Analysis, and Theoretical Guarantees

Empirical Findings

ABC-RL and AB-CPO maintain $\approx97\%$ of unconstrained baseline return while nearly eradicating shaking/spinning (to $<0.05$ or $<0.01$ per step).
Static penalty methods either heavily degrade return or fail to prevent non-naturalistic motions.
In DMLab-30, ABC-RL attains $90$- $100\%$ of baseline return with negligible behavioral costs.
Notably, in exploratory games (e.g., rooms_watermaze), transient performance drops show that adaptive reweighting effectively mediates exploration-exploitation trade-offs by temporarily altering permitted behavioral flexibility.
Trajectories produced exhibit qualitatively smoother, more consistent motion, aligning more closely with informal human perception criteria.

Theoretical Properties

The ABC-RL Bayesian framework admits formal guarantees: Under mild Lipschitz conditions on log-likelihoods with respect to statistics $f$ , the KL divergence between the true and ABC posterior is bounded: $D_{\mathrm{KL}}\left(\beta(\cdot \mid h) \, \| \, \beta_\epsilon(\cdot \mid h)\right) \leq \ln|\mathcal{A}_\epsilon| + 2L\epsilon$ As $\epsilon \to 0$ , the approximation converges in KL divergence (Dimitrakakis et al., 2013).

5. Limitations and Future Directions

Adaptivity and Robustness

A key advantage of ABC-RL as behavioral shaping is the elimination of hand-crafted, fixed penalties; however, the true behavioral cost function $C(t)$ must be chosen to reflect desired constraints accurately. The sigmoid-based weighting is empirically more stable than dual-gradient variants, though inherent trade-offs remain when behavioral constraints interact adversarially with task optimality.

For ABC-RL in the Bayesian sense, expressivity and sufficiency of the chosen summary statistic $f$ determine both statistical efficiency and policy optimality. Naïve rejection-based ABC sampling can be computationally expensive for small acceptance radii $\epsilon$ , necessitating future work on more efficient posterior construction (e.g., MCMC-ABC or SMC-ABC).

Research Directions

Design of near-sufficient statistics for RL environments, including trajectory-level feature expectations and conditional utilities.
Application to high-dimensional domains, partial observability, or multi-agent systems.
Exploration of more sophisticated adaptive cost mechanisms incorporating human-in-the-loop feedback or richer perceptual metrics.

6. Significance and Context

Adaptive Behavioral Costs in RL and Approximate Bayesian Computation RL occupy crucial but distinct positions in the development of reinforcement learning methodologies. The former advances the practical deployment of agents in settings where interpretability and alignment with human behavioral priors are critical, enabling near state-of-the-art task proficiency without robotic, unnatural action paths (Ho et al., 2023). The latter offers a pragmatic avenue for likelihood-free, simulation-based Bayesian RL across complex or unmodeled domains, leveraging simulation when direct model likelihoods are intractable (Dimitrakakis et al., 2013).

A plausible implication is that the unifying principle of adaptively regularizing policy learning—either to enforce behavioral norms or to propagate model uncertainty—will continue to underpin advances in both technical performance and the naturalism of RL algorithms.

PDF Markdown Chat (Pro)

References (2)

Towards Human-Like RL: Taming Non-Naturalistic Behavior in Deep RL via Adaptive Behavioral Costs in 3D Games (2023)

ABC Reinforcement Learning (2013)

Follow Topic

Get notified by email when new papers are published related to ABC-RL.