Smooth Log-Barrier Penalty

Updated 4 July 2026

Smooth log-barrier penalty is a constraint-handling device that replaces hard inequality constraints with logarithmic terms that become singular at the boundaries.
It is applied across fields such as policy optimization, reinforcement learning, and safe control to maintain a safe margin to the constraints while preserving differentiability.
Extensions like full-domain smoothing and hybrid penalty-barrier designs improve numerical conditioning and expand applicability to stochastic and nonconvex optimization problems.

A smooth log-barrier penalty is a constraint-handling device that replaces a hard inequality constraint with a logarithmic term that is smooth on the strict interior of the feasible region and becomes singular at, or is deliberately regularized near, the constraint boundary. In the literature represented here, the term covers both the classical interior-point construction

$-\mu\sum_i \log c_i(x)$

for constraints $c_i(x)>0$ , and several extensions that retain logarithmic barrier geometry while modifying domain behavior, numerical conditioning, or compatibility with stochastic and nonconvex optimization. Across policy optimization, constrained reinforcement learning, nonlinear programming, bound-constrained Newton methods, and safe control, the common objective is to encode “margin to the boundary” into a differentiable surrogate so that optimization is guided by a repulsive force before constraint violation occurs (Zeng et al., 2018, O'Neill et al., 2019, Zhang et al., 2024, Marchi et al., 2024).

1. Classical interior-point structure

The generic inequality-constrained problem used to motivate smooth log-barrier penalties is

$\min_x f(x)\qquad \text{subject to } c_i(x)\ge 0,\quad i\in\mathcal I,$

with strict feasible region

$\mathcal F^0=\{x\in\mathbb R^n\mid c_i(x)>0\ \text{for all } i\in\mathcal I\}.$

The classical logarithmic barrier is

$-\sum_{i\in\mathcal I}\log c_i(x),$

and the associated barrier objective is

$L(x;\mu)=f(x)-\mu\sum_{i\in\mathcal I}\log c_i(x),$

where $\mu$ is the barrier parameter (Zeng et al., 2018). In the bound-constrained setting $x\ge 0$ , the same construction appears as

$\phi_\mu(x)=f(x)-\mu\sum_{i=1}^n \log(x_i),\qquad x>0,$

with derivatives

$\nabla \phi_\mu(x)=\nabla f(x)-\mu X^{-1}e,\qquad \nabla^2\phi_\mu(x)=\nabla^2 f(x)+\mu X^{-2}$

(O'Neill et al., 2019).

The defining properties are consistent across these formulations. The barrier is smooth on the strict interior, tends to $c_i(x)>0$ 0 as the boundary is approached in minimization form, and is undefined or effectively infinite outside the feasible interior (Zeng et al., 2018, O'Neill et al., 2019). In maximization form, the sign convention changes but the geometry does not: for PPO-B, the barrier term

$c_i(x)>0$ 1

is defined only when the margin $c_i(x)>0$ 2 is positive, and tends to $c_i(x)>0$ 3 as the boundary is approached from within, which is an infinitely strong deterrent in a maximization objective (Zeng et al., 2018).

A scalar prototype given in the reinforcement-learning literature is

$c_i(x)>0$ 4

with barrier objective

$c_i(x)>0$ 5

This example illustrates the usual interior-point effect: for small $c_i(x)>0$ 6, the barrier objective resembles the original objective in the middle of the feasible set but rises steeply near the boundary, so the minimizer stays strictly interior and approaches the constrained optimum as $c_i(x)>0$ 7 (Zeng et al., 2018).

2. Smoothness, smoothing, and full-domain extensions

The phrase “smooth log-barrier penalty” does not denote a single formula. Within the cited work it refers to several related constructions that preserve logarithmic barrier behavior while changing where the function is defined and how it behaves near or beyond the boundary.

Variant	Representative formula	Defining feature
Classical interior barrier	$c_i(x)>0$ 8	Smooth on strict interior only
PPO-B barrier surrogate	$c_i(x)>0$ 9	Interior penalty on policy-update margin
Linear smoothed log barrier	$\min_x f(x)\qquad \text{subject to } c_i(x)\ge 0,\quad i\in\mathcal I,$ 0 with log branch and affine continuation	Continuous and differentiable everywhere
Marginalized penalty-barrier envelope	$\min_x f(x)\qquad \text{subject to } c_i(x)\ge 0,\quad i\in\mathcal I,$ 1	Full-domain smooth barrier-penalty hybrid
Log-sum-exp softmax barrier	$\min_x f(x)\qquad \text{subject to } c_i(x)\ge 0,\quad i\in\mathcal I,$ 2	Smooth approximation of a max barrier

The most explicit everywhere-defined smoothing appears in constrained reinforcement learning. There, the standard barrier

$\min_x f(x)\qquad \text{subject to } c_i(x)\ge 0,\quad i\in\mathcal I,$ 3

is replaced by the “linear smoothed log barrier function”

$\min_x f(x)\qquad \text{subject to } c_i(x)\ge 0,\quad i\in\mathcal I,$ 4

with smoothing threshold $\min_x f(x)\qquad \text{subject to } c_i(x)\ge 0,\quad i\in\mathcal I,$ 5. This removes the singularity at $\min_x f(x)\qquad \text{subject to } c_i(x)\ge 0,\quad i\in\mathcal I,$ 6, extends the function to infeasible points $\min_x f(x)\qquad \text{subject to } c_i(x)\ge 0,\quad i\in\mathcal I,$ 7, and preserves continuity and differentiability at the stitching point because the derivative of the log branch equals $\min_x f(x)\qquad \text{subject to } c_i(x)\ge 0,\quad i\in\mathcal I,$ 8 there (Zhang et al., 2024).

A different full-domain construction appears in nonconvex constrained optimization through slack-variable marginalization. Starting from a barrier $\min_x f(x)\qquad \text{subject to } c_i(x)\ge 0,\quad i\in\mathcal I,$ 9 defined on $\mathcal F^0=\{x\in\mathbb R^n\mid c_i(x)>0\ \text{for all } i\in\mathcal I\}.$ 0, the scalar inequality envelope is

$\mathcal F^0=\{x\in\mathbb R^n\mid c_i(x)>0\ \text{for all } i\in\mathcal I\}.$ 1

and admits the closed form

$\mathcal F^0=\{x\in\mathbb R^n\mid c_i(x)>0\ \text{for all } i\in\mathcal I\}.$ 2

Near the boundary it coincides with the underlying barrier, but beyond that region it continues linearly with slope $\mathcal F^0=\{x\in\mathbb R^n\mid c_i(x)>0\ \text{for all } i\in\mathcal I\}.$ 3, yielding a globally defined, Lipschitz differentiable penalty-barrier envelope. For equality constraints, the companion construction is

$\mathcal F^0=\{x\in\mathbb R^n\mid c_i(x)>0\ \text{for all } i\in\mathcal I\}.$ 4

(Marchi et al., 2024).

In safe stabilization, smoothing serves yet another purpose. There the nonsmooth maximum barrier

$\mathcal F^0=\{x\in\mathbb R^n\mid c_i(x)>0\ \text{for all } i\in\mathcal I\}.$ 5

is replaced by the log-sum-exp relaxation

$\mathcal F^0=\{x\in\mathbb R^n\mid c_i(x)>0\ \text{for all } i\in\mathcal I\}.$ 6

with bounds

$\mathcal F^0=\{x\in\mathbb R^n\mid c_i(x)>0\ \text{for all } i\in\mathcal I\}.$ 7

This is not a classical interior-point barrier; it is a smooth approximation of a max-type safety certificate. A plausible implication is that “smooth log-barrier penalty” is best treated as a family resemblance term: some variants smooth the singularity of $\mathcal F^0=\{x\in\mathbb R^n\mid c_i(x)>0\ \text{for all } i\in\mathcal I\}.$ 8, while others use logarithms to smooth max constraints rather than boundary singularities (Liu et al., 2 Oct 2025).

3. Barrier geometry in constrained optimization algorithms

In large-scale bound-constrained optimization, the classical barrier is often retained without modification, while smoothness is recovered through scaling and structured linear algebra rather than by altering the barrier itself. For

$\mathcal F^0=\{x\in\mathbb R^n\mid c_i(x)>0\ \text{for all } i\in\mathcal I\}.$ 9

the log-barrier Newton-CG method fixes

$-\sum_{i\in\mathcal I}\log c_i(x),$ 0

and uses the scaled identity

$-\sum_{i\in\mathcal I}\log c_i(x),$ 1

This eliminates the blow-up of the barrier Hessian in $-\sum_{i\in\mathcal I}\log c_i(x),$ 2-scaled coordinates and enables Newton-CG iterations using only Hessian-vector products, not explicit Hessian formation (O'Neill et al., 2019). The method obtains approximate first- and second-order KKT conditions in

$-\sum_{i\in\mathcal I}\log c_i(x),$ 3

iterations, with total operation complexity summarized as

$-\sum_{i\in\mathcal I}\log c_i(x),$ 4

for large $-\sum_{i\in\mathcal I}\log c_i(x),$ 5, or

$-\sum_{i\in\mathcal I}\log c_i(x),$ 6

for smaller $-\sum_{i\in\mathcal I}\log c_i(x),$ 7 (O'Neill et al., 2019).

For box-constrained optimal control, the barrier is embedded directly into the stage and terminal costs of iLQR. The running barrier is

$-\sum_{i\in\mathcal I}\log c_i(x),$ 8

and the terminal barrier has the analogous state-only form (Abhijeet et al., 4 Feb 2026). The derivative structure is diagonal in constrained coordinates: $-\sum_{i\in\mathcal I}\log c_i(x),$ 9 with the corresponding state Hessian defined similarly (Abhijeet et al., 4 Feb 2026). Because these second derivatives are strictly positive in the interior, the barrier contributes positive curvature to $L(x;\mu)=f(x)-\mu\sum_{i\in\mathcal I}\log c_i(x),$ 0 and $L(x;\mu)=f(x)-\mu\sum_{i\in\mathcal I}\log c_i(x),$ 1, which the paper interprets as intrinsic regularization of the iLQR backward pass (Abhijeet et al., 4 Feb 2026).

A hybrid penalty-barrier architecture appears in nonlinear programming with equality constraints and box bounds. There the merit function

$L(x;\mu)=f(x)-\mu\sum_{i\in\mathcal I}\log c_i(x),$ 2

combines a quadratic penalty for equalities with a logarithmic barrier for simple bounds (Neuenhofen, 2018). The equality penalty is handled through a modified augmented Lagrangian technique, while the barrier part is handled by a primal-dual interior-point path-following technique. This suggests a broader pattern: smooth log-barrier penalties are often most effective when paired with algorithmic machinery that explicitly addresses barrier-induced ill-conditioning rather than assuming the logarithmic term alone makes the problem numerically benign (Neuenhofen, 2018).

4. Reinforcement learning and policy optimization

In policy optimization, the barrier viewpoint is used to reinterpret trust-region methods. PPO-B starts from the TRPO-style constrained objective

$L(x;\mu)=f(x)-\mu\sum_{i\in\mathcal I}\log c_i(x),$ 3

subject to

$L(x;\mu)=f(x)-\mu\sum_{i\in\mathcal I}\log c_i(x),$ 4

and argues that PPO’s KL-penalized version is an exterior penalty relaxation whose minimizers may remain infeasible until the penalty parameter becomes very large (Zeng et al., 2018). The proposed interior analogue is

$L(x;\mu)=f(x)-\mu\sum_{i\in\mathcal I}\log c_i(x),$ 5

The practical surrogate replaces KL by

$L(x;\mu)=f(x)-\mu\sum_{i\in\mathcal I}\log c_i(x),$ 6

and optimizes

$L(x;\mu)=f(x)-\mu\sum_{i\in\mathcal I}\log c_i(x),$ 7

With fixed $L(x;\mu)=f(x)-\mu\sum_{i\in\mathcal I}\log c_i(x),$ 8 and $L(x;\mu)=f(x)-\mu\sum_{i\in\mathcal I}\log c_i(x),$ 9, optimized “in the framework of A2C” with SGD, PPO-B is reported to outperform PPO on 34 of 49 Atari games under $\mu$ 0, and on 5 of 7 MuJoCo/PyBullet tasks under the same metric (Zeng et al., 2018).

Constrained RL extends the barrier idea from trust regions to CMDPs. CSAC-LB defines the standard constrained problem over policies $\mu$ 1, but replaces dual updates with a smoothed log-barrier term applied to a learned safety critic (Zhang et al., 2024). The actor objective is

$\mu$ 2

where the barrier input is transformed by

$\mu$ 3

and the barrier factor is fixed at $\mu$ 4 in experiments (Zhang et al., 2024). The paper states that CSAC-LB achieves state-of-the-art performance on several constrained control tasks, remains stable where SAC-Lag can degrade later in training, and is the only compared method that transfers successfully zero-shot on a real quadruped platform (Zhang et al., 2024).

A third RL use of logarithmic barriers is explicit exploration control. Log-Barrier Stochastic Gradient Bandit regularizes the softmax policy with

$\mu$ 5

so that

$\mu$ 6

The added term is a deterministic anti-collapse force that keeps action probabilities away from zero and preserves Fisher non-degeneracy (Cesani et al., 16 Mar 2026). The paper proves that LB-SGB matches the $\mu$ 7 sample complexity of standard SGB under the same favorable exploration assumption, and also establishes a worst-case complexity

$\mu$ 8

without that assumption, albeit with a slower rate (Cesani et al., 16 Mar 2026).

5. Safe learning, black-box optimization, and control

For safe black-box optimization, the barrier is used not merely as a surrogate objective but as a mechanism for keeping every experiment feasible. One line of work uses the classical barrier

$\mu$ 9

for unknown smooth constraints observed through noisy zeroth-order or first-order oracles (Usmanova et al., 2019, Usmanova et al., 2022). In this setting, the difficulty is that the barrier is only locally smooth inside the feasible region and becomes increasingly ill-conditioned near the boundary. To address this, both papers derive adaptive step sizes from estimates of the current margin to the boundary and local smoothness constants. In the stochastic zeroth-order setting of s0-LBM, the step size must satisfy a condition implying

$x\ge 0$ 0

which yields feasible iterates with high probability (Usmanova et al., 2019). In LB-SGD, the adaptive step is chosen as

$x\ge 0$ 1

and the paper provides nonconvex, convex, and strongly convex convergence guarantees together with first-order and zeroth-order sample-complexity bounds (Usmanova et al., 2022).

A different safe-learning construction uses a mixed penalty–logarithmic barrier merit function in derivative-free direct search. There the inequality constraints are split into two groups based on the initial point: $x\ge 0$ 2 and the merit function is

$x\ge 0$ 3

with $x\ge 0$ 4 and $x\ge 0$ 5 in practice (Brilli et al., 2024). This keeps a subset of inequalities strictly interior while allowing temporary violation of the remaining inequalities and all equalities. The paper proves convergence to stationary points under standard assumptions and reports strong performance on CUTEst problems relative to SID-PSM, LOG-DFL, NOMAD, and extreme-barrier direct search (Brilli et al., 2024).

In nonlinear safe control, logarithmic barriers on box constraints are embedded into trajectory optimization rather than pointwise stochastic search. The iLQR construction already noted produces control channels whose feedback gains diminish near saturation: if a control component approaches a bound, the corresponding barrier Hessian entry tends to $x\ge 0$ 6, so the relevant row of

$x\ge 0$ 7

tends to zero (Abhijeet et al., 4 Feb 2026). In barrier-certified control synthesis, a log-sum-exp smoothing of max-type safety constraints is combined with an explicit bump-function patching

$x\ge 0$ 8

to obtain a single $x\ge 0$ 9 control Lyapunov-barrier function with

$\phi_\mu(x)=f(x)-\mu\sum_{i=1}^n \log(x_i),\qquad x>0,$ 0

under strict compatibility assumptions (Liu et al., 2 Oct 2025).

6. Exactness, approximation, and recurring limitations

A recurring issue in the literature is the distinction between pure interior barriers and smooth full-domain penalty-barrier hybrids. Pure log barriers preserve strict feasibility but require a strictly feasible starting point and are undefined outside the interior (Zeng et al., 2018, O'Neill et al., 2019, Usmanova et al., 2022). This is why several later methods smooth or hybridize them.

One route is direct smoothing of the logarithm, as in the linear smoothed log barrier $\phi_\mu(x)=f(x)-\mu\sum_{i=1}^n \log(x_i),\qquad x>0,$ 1, which is “continuous and differentiable everywhere” and can therefore be optimized by SGD even when stochastic updates or initial policies are infeasible (Zhang et al., 2024). Another route is hybridization with exact penalties. In the penalty-barrier framework for nonconvex constrained optimization, the reduced smooth objective

$\phi_\mu(x)=f(x)-\mu\sum_{i=1}^n \log(x_i),\qquad x>0,$ 2

approaches the exact $\phi_\mu(x)=f(x)-\mu\sum_{i=1}^n \log(x_i),\qquad x>0,$ 3 penalty because

$\phi_\mu(x)=f(x)-\mu\sum_{i=1}^n \log(x_i),\qquad x>0,$ 4

as $\phi_\mu(x)=f(x)-\mu\sum_{i=1}^n \log(x_i),\qquad x>0,$ 5 (Marchi et al., 2024). The paper proves that, in the convex setting, if an optimal KKT triplet exists, accumulation points solve the original problem and $\phi_\mu(x)=f(x)-\mu\sum_{i=1}^n \log(x_i),\qquad x>0,$ 6 is eventually constant when $\phi_\mu(x)=f(x)-\mu\sum_{i=1}^n \log(x_i),\qquad x>0,$ 7 is bounded (Marchi et al., 2024).

A distinct but related construction in stochastic constrained machine learning uses a glued quadratic–logarithmic penalty-barrier function

$\phi_\mu(x)=f(x)-\mu\sum_{i=1}^n \log(x_i),\qquad x>0,$ 8

inside a stochastic penalty-barrier Lagrangian and then smooths the whole objective by a Moreau-envelope/proximal term rather than smoothing the logarithm itself (Bosák et al., 18 May 2026). The method adds only linear runtime overhead compared with unconstrained Adam for up to $\phi_\mu(x)=f(x)-\mu\sum_{i=1}^n \log(x_i),\qquad x>0,$ 9 constraints and reports, for example, epoch times of $\nabla \phi_\mu(x)=\nabla f(x)-\mu X^{-1}e,\qquad \nabla^2\phi_\mu(x)=\nabla^2 f(x)+\mu X^{-2}$ 0 s for Adam versus $\nabla \phi_\mu(x)=\nabla f(x)-\mu X^{-1}e,\qquad \nabla^2\phi_\mu(x)=\nabla^2 f(x)+\mu X^{-2}$ 1 s for SPBM on CIFAR-10 with $\nabla \phi_\mu(x)=\nabla f(x)-\mu X^{-1}e,\qquad \nabla^2\phi_\mu(x)=\nabla^2 f(x)+\mu X^{-2}$ 2, and $\nabla \phi_\mu(x)=\nabla f(x)-\mu X^{-1}e,\qquad \nabla^2\phi_\mu(x)=\nabla^2 f(x)+\mu X^{-2}$ 3 s versus $\nabla \phi_\mu(x)=\nabla f(x)-\mu X^{-1}e,\qquad \nabla^2\phi_\mu(x)=\nabla^2 f(x)+\mu X^{-2}$ 4 s on CIFAR-100 with $\nabla \phi_\mu(x)=\nabla f(x)-\mu X^{-1}e,\qquad \nabla^2\phi_\mu(x)=\nabla^2 f(x)+\mu X^{-2}$ 5 (Bosák et al., 18 May 2026). The paper explicitly notes, however, that convergence guarantees remain open in the fully stochastic non-convex non-smooth setting (Bosák et al., 18 May 2026).

Several limitations recur across otherwise different formulations. Strict feasibility requirements remain central for classical interior barriers (Zeng et al., 2018, O'Neill et al., 2019, Abhijeet et al., 4 Feb 2026, Usmanova et al., 2022). Conditioning worsens near the boundary because first derivatives scale reciprocally with slack and second derivatives scale with inverse-square slack (O'Neill et al., 2019, Abhijeet et al., 4 Feb 2026, Usmanova et al., 2022). Barrier parameters are delicate: smaller $\nabla \phi_\mu(x)=\nabla f(x)-\mu X^{-1}e,\qquad \nabla^2\phi_\mu(x)=\nabla^2 f(x)+\mu X^{-2}$ 6 or larger $\nabla \phi_\mu(x)=\nabla f(x)-\mu X^{-1}e,\qquad \nabla^2\phi_\mu(x)=\nabla^2 f(x)+\mu X^{-2}$ 7 improve approximation but sharpen curvature and can make optimization harder (Zeng et al., 2018, Zhang et al., 2024, Liu et al., 2 Oct 2025). Theoretical guarantees are often local or asymptotic rather than global in nonconvex deep-learning settings; several papers use classical convex barrier theory mainly as motivation rather than as a direct theorem for neural-network optimization (Zeng et al., 2018, Zhang et al., 2024).

A common misconception is that all smooth log-barrier penalties are globally smooth finite functions. The surveyed literature does not support that claim. Classical logarithmic barriers are smooth only on the strict interior and singular at the boundary (Zeng et al., 2018, O'Neill et al., 2019). Everywhere-defined variants exist, but they arise only after explicit smoothing, affine continuation, slack-variable marginalization, or hybrid penalty-barrier reformulation (Zhang et al., 2024, Marchi et al., 2024, Bosák et al., 18 May 2026). A second misconception is that barrier methods merely “penalize violations.” Interior log barriers do not do that; they make the objective unusable at the boundary and beyond, which is structurally different from exterior penalties (Zeng et al., 2018).

Taken together, these works show that the smooth log-barrier penalty is best understood not as a single formula but as a design principle: encode constraint margin through a logarithmic geometry, preserve differentiability where optimization operates, and, when necessary, modify the raw barrier so that stochastic, nonconvex, or safety-critical algorithms can still exploit interior-point behavior (Zeng et al., 2018, Zhang et al., 2024, Marchi et al., 2024, Usmanova et al., 2022).