Papers
Topics
Authors
Recent
Search
2000 character limit reached

Variational Markovian Option Critic (VMOC)

Updated 11 May 2026
  • The VMOC introduces a hierarchical reinforcement learning framework using variational inference within the HiT-MDP to construct diverse latent option embeddings with binary optimality variables.
  • It optimizes an evidence lower bound (ELBO) that balances task reward, option diversity, and policy entropy, ensuring convergence to optimal soft option policies.
  • Empirical validations in language reasoning and control environments demonstrate that VMOC effectively replaces explicit step-by-step reasoning with robust, implicit latent actions.

The Variational Markovian Option Critic (VMOC) is an off-policy hierarchical reinforcement learning (HRL) algorithm founded on variational inference within the HiT-MDP (Hierarchical, Temporally-Extended Markov Decision Process) framework. It is designed to construct a diverse library of latent, temporally-extended actions (“options”) as abstract skill embeddings, supporting high-level reasoning in both language and control tasks. VMOC enables implicit reasoning by learning and operating in an abstract latent option space, circumventing the computational cost of explicit, step-by-step reasoning traces, with rigorous guarantees on policy optimality via continuous MDP homomorphisms (Li et al., 22 Jul 2025).

1. Probabilistic Graphical Model and Latent Option Embeddings

VMOC’s framework introduces two binary “optimality” variables at each time step: At(A){0,1}\mathcal{A}^{(A)}_t \in \{0,1\} for actions and Ot(O){0,1}\mathcal{O}^{(O)}_t \in \{0,1\} for options. A full-option trajectory takes the form τ={s0,o1,a0,o0,s1,a1,o1,}\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}.

The joint optimal-trajectory distribution is:

P(τ,A1:T,O1:T)P(s0)t=0T1P(st+1st,at)  P(Atst,at)  P(Otst,at,ot1)P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\; P(\mathcal{A}_t|s_t,a_t)\; P(\mathcal{O}_t|s_t,a_t,o_{t-1})

with the local likelihoods:

  • P(At=1st,at)=exp[r(st,at)]P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)], where rr is environmental reward,
  • P(Ot=1st,at,ot1)=exp[f(st,at,ot1)]P(\mathcal{O}_t = 1 | s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})], where ff is a non-positive diversity regularizer, instantiated as mutual information f=I[O;(s,a)]f = I[O; (s,a)].

The true trajectory posterior is approximated by the HiT-MDP variational distribution qq:

Ot(O){0,1}\mathcal{O}^{(O)}_t \in \{0,1\}0

  • Ot(O){0,1}\mathcal{O}^{(O)}_t \in \{0,1\}1 is the intra-option policy,
  • Ot(O){0,1}\mathcal{O}^{(O)}_t \in \{0,1\}2 is the option policy, implemented via a learned embedding matrix Ot(O){0,1}\mathcal{O}^{(O)}_t \in \{0,1\}3: Ot(O){0,1}\mathcal{O}^{(O)}_t \in \{0,1\}4.
  • Option variables Ot(O){0,1}\mathcal{O}^{(O)}_t \in \{0,1\}5 function as discrete latent “thought” embeddings.

This graphical model enables abstraction of decision-making into a space of diverse latent skills.

2. Variational Objective and ELBO Derivation

The central optimization target is the evidence lower bound (ELBO) on the likelihood of optimal trajectories:

Ot(O){0,1}\mathcal{O}^{(O)}_t \in \{0,1\}6

Upon substitution and cancellation of model dynamics terms:

Ot(O){0,1}\mathcal{O}^{(O)}_t \in \{0,1\}7

Grouping yields expected reward plus diversity against the joint entropic regularization of both policies:

Ot(O){0,1}\mathcal{O}^{(O)}_t \in \{0,1\}8

Maximizing Ot(O){0,1}\mathcal{O}^{(O)}_t \in \{0,1\}9 thus trades off task reward, option diversity, and policy entropy, directly paralleling the soft option-critic principle.

3. Continuous HiT-MDP Homomorphisms and Optimality Guarantees

Let state-option pairs be τ={s0,o1,a0,o0,s1,a1,o1,}\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}0, forming the associated vector bundle τ={s0,o1,a0,o0,s1,a1,o1,}\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}1. An abstract MDP τ={s0,o1,a0,o0,s1,a1,o1,}\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}2 arises via bundle map τ={s0,o1,a0,o0,s1,a1,o1,}\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}3 and action map τ={s0,o1,a0,o0,s1,a1,o1,}\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}4, maintaining:

  • Reward invariance: τ={s0,o1,a0,o0,s1,a1,o1,}\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}5,
  • Transition equivariance: τ={s0,o1,a0,o0,s1,a1,o1,}\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}6.

Optimal Value Equivalence Theorem: If τ={s0,o1,a0,o0,s1,a1,o1,}\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}7 is a continuous HiT-MDP homomorphism,

τ={s0,o1,a0,o0,s1,a1,o1,}\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}8

Policy Lifting asserts that, for an abstract policy τ={s0,o1,a0,o0,s1,a1,o1,}\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}9, a lifted policy P(τ,A1:T,O1:T)P(s0)t=0T1P(st+1st,at)  P(Atst,at)  P(Otst,at,ot1)P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\; P(\mathcal{A}_t|s_t,a_t)\; P(\mathcal{O}_t|s_t,a_t,o_{t-1})0 exists such that its pushforward under P(τ,A1:T,O1:T)P(s0)t=0T1P(st+1st,at)  P(Atst,at)  P(Otst,at,ot1)P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\; P(\mathcal{A}_t|s_t,a_t)\; P(\mathcal{O}_t|s_t,a_t,o_{t-1})1 equals P(τ,A1:T,O1:T)P(s0)t=0T1P(st+1st,at)  P(Atst,at)  P(Otst,at,ot1)P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\; P(\mathcal{A}_t|s_t,a_t)\; P(\mathcal{O}_t|s_t,a_t,o_{t-1})2, ensuring

P(τ,A1:T,O1:T)P(s0)t=0T1P(st+1st,at)  P(Atst,at)  P(Otst,at,ot1)P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\; P(\mathcal{A}_t|s_t,a_t)\; P(\mathcal{O}_t|s_t,a_t,o_{t-1})3

Hence, solving the abstract MDP via VMOC recovers optimality in the original, unabstracted MDP, with guarantees stemming from locally compact Hausdorff action–option space representation theory.

4. The Off-Policy VMOC Algorithm and Training Procedure

VMOC operationalizes the ELBO maximization with an off-policy actor–critic architecture. Core components:

Component Type Name/Notation Role
Critic P(τ,A1:T,O1:T)P(s0)t=0T1P(st+1st,at)  P(Atst,at)  P(Otst,at,ot1)P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\; P(\mathcal{A}_t|s_t,a_t)\; P(\mathcal{O}_t|s_t,a_t,o_{t-1})4, P(τ,A1:T,O1:T)P(s0)t=0T1P(st+1st,at)  P(Atst,at)  P(Otst,at,ot1)P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\; P(\mathcal{A}_t|s_t,a_t)\; P(\mathcal{O}_t|s_t,a_t,o_{t-1})5 P(τ,A1:T,O1:T)P(s0)t=0T1P(st+1st,at)  P(Atst,at)  P(Otst,at,ot1)P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\; P(\mathcal{A}_t|s_t,a_t)\; P(\mathcal{O}_t|s_t,a_t,o_{t-1})6 and P(τ,A1:T,O1:T)P(s0)t=0T1P(st+1st,at)  P(Atst,at)  P(Otst,at,ot1)P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\; P(\mathcal{A}_t|s_t,a_t)\; P(\mathcal{O}_t|s_t,a_t,o_{t-1})7
Policy P(τ,A1:T,O1:T)P(s0)t=0T1P(st+1st,at)  P(Atst,at)  P(Otst,at,ot1)P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\; P(\mathcal{A}_t|s_t,a_t)\; P(\mathcal{O}_t|s_t,a_t,o_{t-1})8, P(τ,A1:T,O1:T)P(s0)t=0T1P(st+1st,at)  P(Atst,at)  P(Otst,at,ot1)P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\; P(\mathcal{A}_t|s_t,a_t)\; P(\mathcal{O}_t|s_t,a_t,o_{t-1})9 P(At=1st,at)=exp[r(st,at)]P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]0 and P(At=1st,at)=exp[r(st,at)]P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]1
Temperature P(At=1st,at)=exp[r(st,at)]P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]2, P(At=1st,at)=exp[r(st,at)]P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]3 Regulation of policy entropy

Training proceeds via environment rollouts and experience replay. Key steps:

  1. Collect transitions P(At=1st,at)=exp[r(st,at)]P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]4 into buffer P(At=1st,at)=exp[r(st,at)]P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]5.
  2. Sample batches for updates.
  3. Optimize critic targets:

P(At=1st,at)=exp[r(st,at)]P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]6

P(At=1st,at)=exp[r(st,at)]P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]7

  1. Policy update via entropy-regularized gradients:

P(At=1st,at)=exp[r(st,at)]P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]8

P(At=1st,at)=exp[r(st,at)]P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]9

  1. Entropy temperature tuning:

rr0

rr1

Adam optimizer and soft updates for target networks are used throughout.

5. Core Update Equations and Hyper-Parameter Specifications

VMOC employs rr2 critic losses and policy gradients derived as rr3. Default hyper-parameters as outlined:

Parameter Value Description
Discount (rr4) 0.99 Future reward discount
Critic-target (rr5) 0.005 Polyak averaging parameter
Batch size 256 For replay buffer sampling
Replay buffer size rr6 Experience replay capacity
Learning rate (actor/critic/temp) rr7 Adam learning rates
Adam rr8 rr9 Adam optimizer epsilon
Initial P(Ot=1st,at,ot1)=exp[f(st,at,ot1)]P(\mathcal{O}_t = 1 | s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]0’s 0.1 Entropy regularization
Target entropies (P(Ot=1st,at,ot1)=exp[f(st,at,ot1)]P(\mathcal{O}_t = 1 | s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]1) P(Ot=1st,at,ot1)=exp[f(st,at,ot1)]P(\mathcal{O}_t = 1 | s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]2, P(Ot=1st,at,ot1)=exp[f(st,at,ot1)]P(\mathcal{O}_t = 1 | s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]3 For actions, options
Latent options (P(Ot=1st,at,ot1)=exp[f(st,at,ot1)]P(\mathcal{O}_t = 1 | s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]4) 4 Number of option embeddings
Option embedding dim (P(Ot=1st,at,ot1)=exp[f(st,at,ot1)]P(\mathcal{O}_t = 1 | s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]5) 40 Embedding vector dimension

These settings support robust convergence and diversity in the learned option set.

6. Theoretical Guarantees and Optimality Preservation

Two main theoretical results underpin VMOC’s guarantee of global optimality:

  • Soft Option Policy Iteration (Theorem 3.2): Under exact tabular inference, iterated policy evaluation (E-step) and policy improvement (M-step) over P(Ot=1st,at,ot1)=exp[f(st,at,ot1)]P(\mathcal{O}_t = 1 | s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]6 converges to the unique ELBO maximizer, aligned with the true optimal soft option policies P(Ot=1st,at,ot1)=exp[f(st,at,ot1)]P(\mathcal{O}_t = 1 | s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]7.
  • Continuous HiT-MDP Homomorphism (Theorems 5.1 & 5.4): If P(Ot=1st,at,ot1)=exp[f(st,at,ot1)]P(\mathcal{O}_t = 1 | s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]8 defines a homomorphic abstraction, then P(Ot=1st,at,ot1)=exp[f(st,at,ot1)]P(\mathcal{O}_t = 1 | s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]9, and any lifted policy ff0 of the abstract optimum achieves identical value in the original MDP.

Together, these results imply that VMOC’s variational inference maximizes a principled lower bound of the control objective, while the abstract optimal policy discovered in latent option space lifts to a true optimum in the original process, with no loss of optimality.

7. Applications and Empirical Validation

VMOC’s approach has been validated on both complex logical reasoning benchmarks and challenging locomotion domains. The learned options serve as latent “reasoning steps” in language settings and as abstract skills in control environments, offering an efficient, implicit alternative to explicit step-by-step reasoning such as Chain-of-Thought prompting. The framework’s combination of temporal abstraction, diversity regularization, and strong theoretical grounding supports its deployment for scalable, robust skill acquisition across modalities (Li et al., 22 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Variational Markovian Option Critic (VMOC).