The VMOC introduces a hierarchical reinforcement learning framework using variational inference within the HiT-MDP to construct diverse latent option embeddings with binary optimality variables.
It optimizes an evidence lower bound (ELBO) that balances task reward, option diversity, and policy entropy, ensuring convergence to optimal soft option policies.
Empirical validations in language reasoning and control environments demonstrate that VMOC effectively replaces explicit step-by-step reasoning with robust, implicit latent actions.
The Variational Markovian Option Critic (VMOC) is an off-policy hierarchical reinforcement learning (HRL) algorithm founded on variational inference within the HiT-MDP (Hierarchical, Temporally-Extended Markov Decision Process) framework. It is designed to construct a diverse library of latent, temporally-extended actions (“options”) as abstract skill embeddings, supporting high-level reasoning in both language and control tasks. VMOC enables implicit reasoning by learning and operating in an abstract latent option space, circumventing the computational cost of explicit, step-by-step reasoning traces, with rigorous guarantees on policy optimality via continuous MDP homomorphisms (Li et al., 22 Jul 2025).
1. Probabilistic Graphical Model and Latent Option Embeddings
VMOC’s framework introduces two binary “optimality” variables at each time step: At(A)∈{0,1} for actions and Ot(O)∈{0,1} for options. A full-option trajectory takes the form τ={s0,o−1,a0,o0,s1,a1,o1,…}.
P(At=1∣st,at)=exp[r(st,at)], where r is environmental reward,
P(Ot=1∣st,at,ot−1)=exp[f(st,at,ot−1)], where f is a non-positive diversity regularizer, instantiated as mutual informationf=I[O;(s,a)].
The true trajectory posterior is approximated by the HiT-MDP variational distribution q:
Ot(O)∈{0,1}0
Ot(O)∈{0,1}1 is the intra-option policy,
Ot(O)∈{0,1}2 is the option policy, implemented via a learned embedding matrix Ot(O)∈{0,1}3: Ot(O)∈{0,1}4.
Option variables Ot(O)∈{0,1}5 function as discrete latent “thought” embeddings.
This graphical model enables abstraction of decision-making into a space of diverse latent skills.
2. Variational Objective and ELBO Derivation
The central optimization target is the evidence lower bound (ELBO) on the likelihood of optimal trajectories:
Ot(O)∈{0,1}6
Upon substitution and cancellation of model dynamics terms:
Ot(O)∈{0,1}7
Grouping yields expected reward plus diversity against the joint entropic regularization of both policies:
Ot(O)∈{0,1}8
Maximizing Ot(O)∈{0,1}9 thus trades off task reward, option diversity, and policy entropy, directly paralleling the soft option-critic principle.
3. Continuous HiT-MDP Homomorphisms and Optimality Guarantees
Let state-option pairs be τ={s0,o−1,a0,o0,s1,a1,o1,…}0, forming the associated vector bundle τ={s0,o−1,a0,o0,s1,a1,o1,…}1. An abstract MDP τ={s0,o−1,a0,o0,s1,a1,o1,…}2 arises via bundle map τ={s0,o−1,a0,o0,s1,a1,o1,…}3 and action map τ={s0,o−1,a0,o0,s1,a1,o1,…}4, maintaining:
Optimal Value Equivalence Theorem: If τ={s0,o−1,a0,o0,s1,a1,o1,…}7 is a continuous HiT-MDP homomorphism,
τ={s0,o−1,a0,o0,s1,a1,o1,…}8
Policy Lifting asserts that, for an abstract policy τ={s0,o−1,a0,o0,s1,a1,o1,…}9, a lifted policy P(τ,A1:T,O1:T)∝P(s0)t=0∏T−1P(st+1∣st,at)P(At∣st,at)P(Ot∣st,at,ot−1)0 exists such that its pushforward under P(τ,A1:T,O1:T)∝P(s0)t=0∏T−1P(st+1∣st,at)P(At∣st,at)P(Ot∣st,at,ot−1)1 equals P(τ,A1:T,O1:T)∝P(s0)t=0∏T−1P(st+1∣st,at)P(At∣st,at)P(Ot∣st,at,ot−1)2, ensuring
Hence, solving the abstract MDP via VMOC recovers optimality in the original, unabstracted MDP, with guarantees stemming from locally compact Hausdorff action–option space representation theory.
4. The Off-Policy VMOC Algorithm and Training Procedure
VMOC operationalizes the ELBO maximization with an off-policy actor–critic architecture. Core components:
P(τ,A1:T,O1:T)∝P(s0)t=0∏T−1P(st+1∣st,at)P(At∣st,at)P(Ot∣st,at,ot−1)6 and P(τ,A1:T,O1:T)∝P(s0)t=0∏T−1P(st+1∣st,at)P(At∣st,at)P(Ot∣st,at,ot−1)7
Option embedding dim (P(Ot=1∣st,at,ot−1)=exp[f(st,at,ot−1)]5)
40
Embedding vector dimension
These settings support robust convergence and diversity in the learned option set.
6. Theoretical Guarantees and Optimality Preservation
Two main theoretical results underpin VMOC’s guarantee of global optimality:
Soft Option Policy Iteration (Theorem 3.2): Under exact tabular inference, iterated policy evaluation (E-step) and policy improvement (M-step) over P(Ot=1∣st,at,ot−1)=exp[f(st,at,ot−1)]6 converges to the unique ELBO maximizer, aligned with the true optimal soft option policies P(Ot=1∣st,at,ot−1)=exp[f(st,at,ot−1)]7.
Continuous HiT-MDP Homomorphism (Theorems 5.1 & 5.4): If P(Ot=1∣st,at,ot−1)=exp[f(st,at,ot−1)]8 defines a homomorphic abstraction, then P(Ot=1∣st,at,ot−1)=exp[f(st,at,ot−1)]9, and any lifted policy f0 of the abstract optimum achieves identical value in the original MDP.
Together, these results imply that VMOC’s variational inference maximizes a principled lower bound of the control objective, while the abstract optimal policy discovered in latent option space lifts to a true optimum in the original process, with no loss of optimality.
7. Applications and Empirical Validation
VMOC’s approach has been validated on both complex logical reasoning benchmarks and challenging locomotion domains. The learned options serve as latent “reasoning steps” in language settings and as abstract skills in control environments, offering an efficient, implicit alternative to explicit step-by-step reasoning such as Chain-of-Thought prompting. The framework’s combination of temporal abstraction, diversity regularization, and strong theoretical grounding supports its deployment for scalable, robust skill acquisition across modalities (Li et al., 22 Jul 2025).
“Emergent Mind helps me see which AI papers have caught fire online.”
Philip
Creator, AI Explained on YouTube
Sign up for free to explore the frontiers of research
Discover trending papers, chat with arXiv, and track the latest research shaping the future of science and technology.Discover trending papers, chat with arXiv, and more.