Entropy-Regularized LAPO Objective Overview

Updated 4 October 2025

The entropy-regularized LAPO objective is a framework that integrates an entropy penalty with latent action learning to enforce deterministic, state-agnostic mappings.
It ensures representations are disentangled and injective, preventing trivial or collapsed encodings and aligning latent actions with true actions.
The method enhances statistical efficiency by enabling unsupervised learning from unlabeled transitions and facilitating robust transfer with minimal labeled data.

The entropy-regularized LAPO objective refers to a class of loss or optimization formulations in which a latent action policy objective (typically arising in unsupervised action representation learning from state-transition data) is augmented with an explicit entropy penalty acting on the inverse dynamics model. This regularization is designed both to induce identifiability of the learnt action representations and to ensure desirable statistical and practical properties for downstream policy learning. The approach has recently been analyzed in detail, with focus on conditions for identifiability, characteristic degeneracies in the absence of regularization, and implications for the statistical efficiency of policy learning with discovered discrete latent actions (Lachapelle, 1 Oct 2025).

1. Mathematical Formulation

The canonical entropy-regularized LAPO objective is defined in the setting where one seeks to extract a low-cardinality action representation ˆa from transition data (x, x′), with x the current state (observation), x′ the next state, and no explicit access to ground truth action labels. The framework consists of:

a forward dynamics model $\hat{f} \in \mathcal{G}$ , mapping $(x, \hat{a})$ to a predicted next observation;
an inverse dynamics model $\hat{q} \in \mathcal{Q}$ , mapping $(x, x')$ to a distribution over latent actions.

The core objective is: $\min_{\hat{f} \in \mathcal{G},\, \hat{q} \in \mathcal{Q}}\, \mathbb{E}_{x,x'}\,\left[\, \mathbb{E}_{\hat{a} \sim \hat{q}(\cdot\,|\,x,x')}\,\|x'-\hat{f}(x, \hat{a})\|^2_2 + \beta\, H\big(\hat{q}(\cdot\,|\,x,x')\big)\, \right],$ where $H(\hat{q}(\cdot\,|\,x,x')) = -\sum_{\hat{a}} \hat{q}(\hat{a}|x,x') \log \hat{q}(\hat{a}|x,x')$ and $\beta>0$ is a tunable trade-off parameter.

The entropy term penalizes stochasticity in the inverse mapping, driving $\hat{q}(\cdot|x,x')$ toward being nearly deterministic at optimality.

2. Properties: Determinism, Disentanglement, Informativeness

The regularized objective, under sufficient conditions, provably enforces three key properties on the learned latent actions:

Determinism: For every $(x, x')$ in the support of the data, the minimizer $\hat{q}$ must satisfy

$\hat{q}(\hat{a}|x,x') = 1[\hat{a} = \psi(x,x')]$

for some deterministic map $\psi$ . This outcome is a direct effect of entropy regularization with $\beta>0$ ; only deterministic $\hat{q}$ achieve $H(\hat{q})=0$ .

Disentanglement: Under additional assumptions (chiefly, continuity and injectivity of the underlying forward process $x,a\mapsto x'$ and topological nondegeneracy of the state support for each action), there exists a function $\sigma: A\to \hat{A}$ , independent of $x$ , such that

$\psi(x, x') = \sigma(a)$

for observed transitions $(x, a, x')$ . The latent action is thus solely determined by the (unknown) true action, rather than by state-specific information.

Informativeness (Injectivity): The mapping $\sigma$ is injective, so that distinct ground truth actions correspond to distinct latent actions. This ensures the latent action “alphabet” does not collapse and preserves the full structure of the original action space.

These properties together guarantee that the representation discovered via the entropy-regularized LAPO objective is a permutation (possibly up to relabeling) of the true action labels, disentangled from spurious dependencies on the state.

3. Sufficient Conditions for Identifiability

The correctness and identifiability guarantees depend on several assumptions:

Continuity: The map $x \mapsto (x, a)$ , interpreted as $x \to x'$ for any action $a$ , is continuous.
Injectivity: For any state $x$ and actions $a_1 \neq a_2$ , the resulting transitions $(x, a_1)$ and $(x, a_2)$ yield different next states.
Support Overlap and Connectedness: For any two actions $a_1, a_2$ , the supports $\mathrm{supp}(p(x|a_1)),\, \mathrm{supp}(p(x|a_2))$ are both connected and intersect.

With these assumptions, the entropy-regularized LAPO minimizer is deterministic, disentangled, and informative (Theorem 1 in (Lachapelle, 1 Oct 2025)). The entropy regularization here operates as an inductive bias ruling out trivial or degenerate solutions that can arise in the absence of additional constraints.

4. Avoidance of Degenerate Solutions

In the absence of proper regularization and model constraints, degeneracies can arise:

Trivial Encodings: The forward model (if unconstrained) could have latent space as rich as $x'$ and set $\hat{q}(\hat{a}|x,x') = \delta(x' - \hat{a})$ with $\hat{f}(x, \hat{a}) = \hat{a}$ , circumventing the semantics of a meaningful action.
Collapsed Representations: For deterministic experts $\pi(a|x)$ , it is possible to always assign a constant (unvarying) latent action in $q$ , leading to identifiability failure.

The entropy term prevents these pathologies by forcing $q(\cdot|x,x')$ to concentrate its mass—without splitting—only if the action follows a unique, state-agnostic mapping from the observed transition, and not if the state itself allows for “encoding” the ground truth action arbitrarily.

5. Statistical Benefits and Sample Efficiency

Once identifiability is established (i.e., the minimizer yields a deterministic, injective, disentangled mapping), the process of learning a true latent policy across vast amounts of unlabeled video data becomes not only possible but statistically efficient. The latent policy can then be mapped to the true action space via a simple classifier or mapping, trained on a much smaller, action-labeled dataset. This two-stage approach—unsupervised LAPO with entropy regularization, followed by sparse supervised mapping—has proven to be highly sample-efficient and robust in applications (as exploited in the Genie and LAPA frameworks).

Discrete latent action representations, particularly, enable efficient downstream transfer with minimal labeled data and result in superior statistical efficiency compared to methods producing entangled or stochastic representations.

6. Algorithmic Implementation

The standard optimization employs stochastic minibatch updates, with gradient-based learning applied jointly to both $\hat{f}$ and $\hat{q}$ . The entropy term $H(q(\cdot|x,x'))$ is differentiable with respect to $q$ ’s parameters and can be computed (or estimated) efficiently in both discrete and reparameterizable continuous settings. To ensure the regularizer is effective, $\beta$ should be set large enough to counteract the tendency for $\hat{q}$ to become diffuse; in practice, this is tuned as a hyperparameter.

For discrete latent actions, categorical distributions are used; for continuous actions, reparametrization techniques are employed for backpropagation through samples from $q$ .

7. Practical Implications and Empirical Observations

Discrete entropy-regularized LAPO representations have demonstrated superior empirical performance in real-world tasks that require discovery of semantically meaningful and transferable action spaces from high-dimensional, unannotated data. The process:

Reduces the reliance on large sets of manually labeled trajectories,
Enhances sample and label efficiency in policy learning,
Avoids degenerate representations that impair transfer or causality, and
Achieves transferability across environments due to disentanglement.

Such representations are particularly advantageous in robotic imitation learning, video-based reinforcement learning, and scenarios where collecting action labels is prohibitively costly.

Summary Table: Key Ingredients and Guarantees

Component	Role in Objective	Effect at Optimum
$\\|x'-\hat{f}(x,\hat{a})\\|^2$	Enforces reconstructibility via actions	Maps $(x, \hat{a}) \mapsto x'$
$H(\hat{q}(\cdot\|x,x'))$	Penalizes posterior stochasticity	Forces deterministic $\hat{q}$
$\beta>0$	Controls strength of regularization	Tuning for identifiability
Support/intersection cond.	Rules out trivial state-dependent encodings	Enables disentanglement

In conclusion, the entropy-regularized LAPO objective provides a principled, theoretically grounded approach to discovering robust, disentangled, and identifiable action representations from observation-only data under broad and realistic structural assumptions (Lachapelle, 1 Oct 2025). This framework underlies the high practical efficacy of discrete latent action learning in recent work.

PDF Markdown Chat (Pro)

References (1)

On the Identifiability of Latent Action Policies (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Entropy-Regularized LAPO Objective.