FLAC: Field Least-Energy Actor–Critic

Updated 4 July 2026

FLAC is a field-theoretic actor–critic framework that leverages path-space formulations and kinetic energy minimization to guide continuous control policies.
It recovers the maximum-entropy principle by using a Generalized Schrödinger Bridge objective, bypassing the need for explicit action log-density evaluation.
Empirical results on DMControl and HumanoidBench tasks demonstrate that FLAC’s energy regularization and automatic dual tuning can outperform traditional methods.

Field Least-Energy Actor-Critic (FLAC) is an off-policy actor–critic framework for continuous control in which the actor is an iterative generative policy parameterized by a velocity field and policy stochasticity is regulated by penalizing the kinetic energy of that field rather than by explicitly computing action log-densities. In the named formulation, FLAC casts policy optimization as a Generalized Schrödinger Bridge (GSB) problem relative to a high-entropy reference process, so that the maximum-entropy principle is realized as remaining close to a high-entropy reference while optimizing return, without requiring explicit density estimation (Lv et al., 13 Feb 2026). The surrounding literature uses related field-based constructions in broader senses, including potential-field-guided actor–critic updates, continuous-time control-field gradient flows, mean-field actor–critic on Wasserstein space, and critic-as-field regularization; taken together, these works situate FLAC within a wider family of actor–critic methods that treat guidance, smoothness, or exploration as properties of a field or energy landscape (Ren, 2020, Lee et al., 30 Jan 2026, Zhou et al., 2024, Frikha et al., 2023).

1. Conceptual scope and uses of “field” and “least energy”

Within the literature considered here, “field” refers to several mathematically distinct objects: a velocity field over latent action trajectories, an artificial potential field over state space, a control field in continuous time, a value function defined on state and population distributions, or the action-gradient field induced by a critic. Correspondingly, “least energy” may refer to minimizing kinetic energy along a generative path, descending a potential landscape, penalizing control effort such as $\frac12 |u|^2$ or $a^\top N a$ , or suppressing volatility in the critic’s mixed derivatives while preserving action-space curvature (Lv et al., 13 Feb 2026, Ren, 2020, Zhou et al., 2024, Frikha et al., 2023, Lee et al., 30 Jan 2026).

Source	Field object	Least-energy mechanism
FLAC (Lv et al., 13 Feb 2026)	Velocity field $u_\theta(s,\tau,x)$	Kinetic energy $\mathbb{E}\!\left[\int_0^1 \tfrac12\\|u_\theta\\|^2 d\tau\right]$
Potential-field-guided actor–critic (Ren, 2020)	Potential field $U(s)$ and force $f=-\nabla U(s)$	Alignment term $q_{PF}(s,a)=-U(s)[1-\cos(\chi)]$
Actor-critic flow (Zhou et al., 2024)	Control field $u^\tau(t,x)$	Running cost can include $\frac12 \|u\|^2$
Mean-field actor–critic (Frikha et al., 2023)	Policy and value on $(t,x,\mu)$	Running cost can include $a^\top N a$ 0
PAVE (Lee et al., 30 Jan 2026)	Critic scalar field $a^\top N a$ 1	Regularizes Q-gradient volatility while preserving curvature

This plurality is important because FLAC, in the narrow sense, denotes the 2026 likelihood-free maximum-entropy algorithm. At the same time, the earlier and adjacent works provide a technical vocabulary for understanding why a field-centered actor–critic can be interpreted as least-energy control: the policy may be driven toward low-energy trajectories in path space, toward low-potential states in configuration space, or toward smooth low-volatility critic geometries in state–action space. A plausible implication is that FLAC is best viewed not as an isolated algorithmic label but as one member of a broader field-theoretic actor–critic lineage.

2. Path-space formulation of FLAC

The central FLAC construction starts from the observation that iterative generative policies such as diffusion models and flow matching are expressive for continuous control but make standard Maximum Entropy RL difficult because $a^\top N a$ 2 is not directly accessible. FLAC addresses this by lifting policy optimization from action space to path space. The actor is a velocity field $a^\top N a$ 3 that transports latent noise $a^\top N a$ 4 to a terminal action $a^\top N a$ 5 through the SDE or ODE

$a^\top N a$ 6

The maximum-entropy principle is then recovered through a GSB objective defined relative to a high-entropy reference path measure $a^\top N a$ 7, typically Brownian motion from a high-entropy prior (Lv et al., 13 Feb 2026).

The GSB objective has the form

$a^\top N a$ 8

subject to $a^\top N a$ 9. In the one-ended GSB used for RL, the terminal marginal is not fixed; instead, a soft terminal potential encodes task preference. The optimal terminal marginal satisfies

$u_\theta(s,\tau,x)$ 0

When $u_\theta(s,\tau,x)$ 1 is approximately uniform over the bounded action domain and $u_\theta(s,\tau,x)$ 2, this becomes a Boltzmann-type policy

$u_\theta(s,\tau,x)$ 3

which matches the conceptual role of maximum-entropy policies without requiring explicit evaluation of $u_\theta(s,\tau,x)$ 4 (Lv et al., 13 Feb 2026).

This formulation changes the regularization target. Instead of directly maximizing action entropy, FLAC penalizes deviation of the controlled path measure from a high-entropy reference process. The resulting least-energy principle is therefore pathwise rather than distributional at the action level. A plausible implication is that FLAC replaces “entropy of the terminal policy” by “energetic proximity of the full generative dynamics to a diffuse reference,” which is the precise sense in which it is both maximum-entropy and likelihood-free.

3. Kinetic energy, Bellman structure, and policy iteration

The kinetic energy of the policy’s generative dynamics at state $u_\theta(s,\tau,x)$ 5 is defined as

$u_\theta(s,\tau,x)$ 6

This is the key least-energy functional in FLAC. In the stochastic regime $u_\theta(s,\tau,x)$ 7, the path-space KL divergence between the controlled process and the driftless reference satisfies

$u_\theta(s,\tau,x)$ 8

and, by Data Processing,

$u_\theta(s,\tau,x)$ 9

In the deterministic regime $\mathbb{E}\!\left[\int_0^1 \tfrac12\|u_\theta\|^2 d\tau\right]$ 0, the Benamou–Brenier formulation yields

$\mathbb{E}\!\left[\int_0^1 \tfrac12\|u_\theta\|^2 d\tau\right]$ 1

These identities justify kinetic energy as a computable proxy for divergence from a high-entropy reference in both SDE and ODE settings (Lv et al., 13 Feb 2026).

The critic is defined through an energy-regularized Bellman operator

$\mathbb{E}\!\left[\int_0^1 \tfrac12\|u_\theta\|^2 d\tau\right]$ 2

which is a $\mathbb{E}\!\left[\int_0^1 \tfrac12\|u_\theta\|^2 d\tau\right]$ 3-contraction in the sup-norm and therefore admits a unique fixed point $\mathbb{E}\!\left[\int_0^1 \tfrac12\|u_\theta\|^2 d\tau\right]$ 4. The fixed point is the energy-regularized value function

$\mathbb{E}\!\left[\int_0^1 \tfrac12\|u_\theta\|^2 d\tau\right]$ 5

In practice, the target used for off-policy critic learning is

$\mathbb{E}\!\left[\int_0^1 \tfrac12\|u_\theta\|^2 d\tau\right]$ 6

with target critics, sampled next action $\mathbb{E}\!\left[\int_0^1 \tfrac12\|u_\theta\|^2 d\tau\right]$ 7, and a discretized kinetic-energy estimate (Lv et al., 13 Feb 2026).

The actor minimizes

$\mathbb{E}\!\left[\int_0^1 \tfrac12\|u_\theta\|^2 d\tau\right]$ 8

where $\mathbb{E}\!\left[\int_0^1 \tfrac12\|u_\theta\|^2 d\tau\right]$ 9 is obtained by integrating the generative dynamics. This objective is optimized with pathwise differentiation through the numerical solver. FLAC also introduces a Lagrangian dual mechanism to tune $U(s)$ 0 automatically under an average energy constraint

$U(s)$ 1

Writing $U(s)$ 2, the dual update is

$U(s)$ 3

If the average energy exceeds the target, $U(s)$ 4 increases; if it is below target, $U(s)$ 5 decreases. Empirically, $U(s)$ 6 often exhibits a “decrease then increase” pattern, which the paper interprets as looser early exploration followed by later re-tightening of the energy budget (Lv et al., 13 Feb 2026).

On the empirical side, FLAC is evaluated on DMControl “hard” tasks, including Humanoid Stand/Run/Walk and Dog Run/Trot/Stand/Walk, and on 12 HumanoidBench H1 tasks. Relative to model-free baselines such as TD7, SAC, DIME, SAC-Flow, and FlowRL, FLAC matches or outperforms strong baselines on many high-dimensional tasks while using Midpoint Euler with NFE $U(s)$ 7 for both training and evaluation. The paper further reports that automatic Lagrangian tuning consistently outperforms fixed regularization, that performance on H1-Walk is robust for a broad range $U(s)$ 8 when $U(s)$ 9, and that increasing NFE beyond $f=-\nabla U(s)$ 0 offers little gain in final performance (Lv et al., 13 Feb 2026).

4. Potential-field guidance as a precursor to least-energy actor–critic

A plausible precursor to FLAC is the actor–critic–2 construction in “Potential Field Guided Actor-Critic Reinforcement Learning,” which extends standard actor–critic to actor–critic– $f=-\nabla U(s)$ 1 by combining multiple critics under a shared policy. In the instantiated actor–critic–2 case, critic 1 is reward-based and model-free, while critic 2 is potential-field-based and effectively model-based. The combined deterministic objective is

$f=-\nabla U(s)$ 2

with actor update

$f=-\nabla U(s)$ 3

The field term is built from an artificial potential

$f=-\nabla U(s)$ 4

where the attractive part is quadratic in distance to the goal and the repulsive part is active within an obstacle influence distance. Action evaluation uses the one-step potential-based quantity

$f=-\nabla U(s)$ 5

with $f=-\nabla U(s)$ 6 the angle between the action $f=-\nabla U(s)$ 7 and the field force $f=-\nabla U(s)$ 8 (Ren, 2020).

The structural property of this critic is that its gradient magnitude scales with $f=-\nabla U(s)$ 9 and its direction depends on the alignment angle $q_{PF}(s,a)=-U(s)[1-\cos(\chi)]$ 0. High-potential states therefore induce strong gradients that drive actions to align with the field direction, whereas low-potential states contribute little field guidance. The paper argues that large potential values correspond to states far from the goal or near obstacles, where potential fields contain strong prior information and the potential-field-based critic should be trusted more; small potential values correspond to local minima or moving-target situations, where the reward critic should dominate. In the experiments, $q_{PF}(s,a)=-U(s)[1-\cos(\chi)]$ 1 is fixed at $q_{PF}(s,a)=-U(s)[1-\cos(\chi)]$ 2, so the state dependence arises implicitly from the scaling by $q_{PF}(s,a)=-U(s)[1-\cos(\chi)]$ 3 rather than from an explicit $q_{PF}(s,a)=-U(s)[1-\cos(\chi)]$ 4 schedule (Ren, 2020).

This motivates a local least-energy reading. The potential $q_{PF}(s,a)=-U(s)[1-\cos(\chi)]$ 5 is an energy landscape, classical potential field control follows $q_{PF}(s,a)=-U(s)[1-\cos(\chi)]$ 6, and $q_{PF}(s,a)=-U(s)[1-\cos(\chi)]$ 7 is an alignment penalty between the policy action and the energy descent direction. The details provided for the paper explicitly state that the potential-field critic encourages “least-energy behavior locally.” The associated predator–prey experiments, based on OpenAI MPE with sparse reward $q_{PF}(s,a)=-U(s)[1-\cos(\chi)]$ 8 on capture and continuous control, show that PGDDPG converges faster and to higher success rate than DDPG in $q_{PF}(s,a)=-U(s)[1-\cos(\chi)]$ 9 vs $u^\tau(t,x)$ 0, that DDPG predators fail in $u^\tau(t,x)$ 1 vs $u^\tau(t,x)$ 2 while PGDDPG predators learn to cooperate with high success, and that potential field evaluation can compensate partly for lack of communication in multi-agent cooperation (Ren, 2020).

5. Continuous-time, mean-field, and control-theoretic generalizations

The least-energy viewpoint also appears in continuous-time stochastic control. “Solving Time-Continuous Stochastic Optimal Control Problems: Algorithm Design and Convergence Analysis of Actor-Critic Flow” studies a finite-horizon stochastic control problem

$u^\tau(t,x)$ 3

with cost

$u^\tau(t,x)$ 4

In many examples, the running cost includes a quadratic control penalty such as $u^\tau(t,x)$ 5, which the paper identifies as an energy or control-effort term. The actor–critic flow is defined in algorithm time $u^\tau(t,x)$ 6 by coupled ODEs for the critic and actor,

$u^\tau(t,x)$ 7

$u^\tau(t,x)$ 8

$u^\tau(t,x)$ 9

The critic is derived from a low-variance Itô temporal difference whose squared error vanishes pathwise when the value and its gradient are exact, and the main theorem establishes global linear convergence of the Lyapunov functional $\frac12 |u|^2$ 0 under a suitable actor–critic speed ratio (Zhou et al., 2024).

A complementary extension appears in “Actor-Critic learning for mean-field control in continuous time,” where the state dynamics depend on both the representative state and the population distribution $\frac12 |u|^2$ 1, and policies are randomized kernels $\frac12 |u|^2$ 2 on Wasserstein space. The entropy-regularized objective is

$\frac12 |u|^2$ 3

The paper derives a score-function policy-gradient representation involving $\frac12 |u|^2$ 4 and a mean-field correction operator $\frac12 |u|^2$ 5, and develops both offline and online actor–critic algorithms based on empirical estimation of the population distribution across episodes. In the linear-quadratic mean-field setting, the running cost includes

$\frac12 |u|^2$ 6

which is explicitly a control-energy or effort term, and the optimal entropy-regularized policy is linear-Gaussian (Frikha et al., 2023).

Taken together, these works show that field least-energy actor–critic ideas extend naturally beyond path-space generative policies. In continuous-time control, least energy appears as quadratic effort in the Hamiltonian; in mean-field control, it appears jointly with entropy regularization and population-dependent value functions; and in both settings the actor update can be interpreted as gradient flow over a control field. This suggests that FLAC-style reasoning is compatible with HJB structure, Fokker–Planck dynamics, and Wasserstein-space parameterizations, even though the named FLAC algorithm itself is formulated in off-policy continuous-control RL with iterative generative actors.

6. Critic-field geometry, smoothness, and open issues

A different but closely related field-centered perspective is developed in “Stabilizing the Q-Gradient Field for Policy Smoothness in Actor-Critic.” That paper treats the critic $\frac12 |u|^2$ 7 as a scalar field over the state–action manifold and the actor’s learning signal as the induced action-gradient field $\frac12 |u|^2$ 8. Using implicit differentiation around the greedy action $\frac12 |u|^2$ 9, it proves

$(t,x,\mu)$ 0

and, under bounded mixed partials and uniformly negative action curvature,

$(t,x,\mu)$ 1

This identifies policy non-smoothness with the ratio of mixed state–action Hessian magnitude to action-space curvature, rather than with the policy network alone (Lee et al., 30 Jan 2026).

The proposed PAVE regularizer modifies only the critic loss: $(t,x,\mu)$ 2 Here, Mixed-Partial Regularization penalizes finite differences of $(t,x,\mu)$ 3 under Gaussian perturbations of $(t,x,\mu)$ 4, Vector Field Consistency penalizes temporal changes of $(t,x,\mu)$ 5 across transitions, and Curvature Preservation penalizes action Hessians that are insufficiently negative. The paper reports smoothness and robustness comparable to policy-side smoothness regularization methods while maintaining competitive task performance, without modifying the actor, and interprets the method as stabilizing the geometry of the Q-gradient field (Lee et al., 30 Jan 2026).

For FLAC proper, the main limitations and open directions are of a different kind. The 2026 FLAC paper notes that the kinetic-energy penalty is isotropic across action dimensions, that the cleanest theory assumes bounded rewards, bounded action spaces, regular vector fields, and high-entropy priors, and that the method requires differentiable integration and backpropagation through trajectories even though NFE is small. It also identifies anisotropic or state-dependent energy constraints as future work. The PAVE paper, from a critic-centered angle, raises complementary open questions about stochastic policies, multi-modal Q landscapes, very high-dimensional action spaces, partial observability, multi-agent settings, and more explicit global energy functionals on critic fields (Lv et al., 13 Feb 2026, Lee et al., 30 Jan 2026).

The combined picture is that FLAC is presently a specific path-space, kinetic-energy-regularized actor–critic algorithm, but it sits inside a broader research program in which actor–critic learning is organized around fields and energies rather than around reward signals alone. In one strand, the actor follows a low-energy generative bridge relative to a reference process; in another, it is guided by analytic potential fields; in another, it evolves as a continuous-time control field; and in yet another, it inherits smoothness from a regularized critic-gradient field. The common principle is that the geometry of the field—velocity, potential, control, or critic—becomes a first-class object in policy optimization.