Variational Markovian Option Critic (VMOC)

Updated 11 May 2026

The VMOC introduces a hierarchical reinforcement learning framework using variational inference within the HiT-MDP to construct diverse latent option embeddings with binary optimality variables.
It optimizes an evidence lower bound (ELBO) that balances task reward, option diversity, and policy entropy, ensuring convergence to optimal soft option policies.
Empirical validations in language reasoning and control environments demonstrate that VMOC effectively replaces explicit step-by-step reasoning with robust, implicit latent actions.

The Variational Markovian Option Critic (VMOC) is an off-policy hierarchical reinforcement learning (HRL) algorithm founded on variational inference within the HiT-MDP (Hierarchical, Temporally-Extended Markov Decision Process) framework. It is designed to construct a diverse library of latent, temporally-extended actions (“options”) as abstract skill embeddings, supporting high-level reasoning in both language and control tasks. VMOC enables implicit reasoning by learning and operating in an abstract latent option space, circumventing the computational cost of explicit, step-by-step reasoning traces, with rigorous guarantees on policy optimality via continuous MDP homomorphisms (Li et al., 22 Jul 2025).

1. Probabilistic Graphical Model and Latent Option Embeddings

VMOC’s framework introduces two binary “optimality” variables at each time step: $\mathcal{A}^{(A)}_t \in \{0,1\}$ for actions and $\mathcal{O}^{(O)}_t \in \{0,1\}$ for options. A full-option trajectory takes the form $\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}$ .

The joint optimal-trajectory distribution is:

$P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\; P(\mathcal{A}_t|s_t,a_t)\; P(\mathcal{O}_t|s_t,a_t,o_{t-1})$

with the local likelihoods:

$P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]$ , where $r$ is environmental reward,
$P(\mathcal{O}_t = 1 | s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]$ , where $f$ is a non-positive diversity regularizer, instantiated as mutual information $f = I[O; (s,a)]$ .

The true trajectory posterior is approximated by the HiT-MDP variational distribution $q$ :

$\mathcal{O}^{(O)}_t \in \{0,1\}$ 0

$\mathcal{O}^{(O)}_t \in \{0,1\}$ 1 is the intra-option policy,
$\mathcal{O}^{(O)}_t \in \{0,1\}$ 2 is the option policy, implemented via a learned embedding matrix $\mathcal{O}^{(O)}_t \in \{0,1\}$ 3: $\mathcal{O}^{(O)}_t \in \{0,1\}$ 4.
Option variables $\mathcal{O}^{(O)}_t \in \{0,1\}$ 5 function as discrete latent “thought” embeddings.

This graphical model enables abstraction of decision-making into a space of diverse latent skills.

2. Variational Objective and ELBO Derivation

The central optimization target is the evidence lower bound (ELBO) on the likelihood of optimal trajectories:

$\mathcal{O}^{(O)}_t \in \{0,1\}$ 6

Upon substitution and cancellation of model dynamics terms:

$\mathcal{O}^{(O)}_t \in \{0,1\}$ 7

Grouping yields expected reward plus diversity against the joint entropic regularization of both policies:

$\mathcal{O}^{(O)}_t \in \{0,1\}$ 8

Maximizing $\mathcal{O}^{(O)}_t \in \{0,1\}$ 9 thus trades off task reward, option diversity, and policy entropy, directly paralleling the soft option-critic principle.

3. Continuous HiT-MDP Homomorphisms and Optimality Guarantees

Let state-option pairs be $\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}$ 0, forming the associated vector bundle $\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}$ 1. An abstract MDP $\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}$ 2 arises via bundle map $\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}$ 3 and action map $\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}$ 4, maintaining:

Reward invariance: $\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}$ 5,
Transition equivariance: $\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}$ 6.

Optimal Value Equivalence Theorem: If $\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}$ 7 is a continuous HiT-MDP homomorphism,

$\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}$ 8

Policy Lifting asserts that, for an abstract policy $\tau = \{s_0, o_{-1}, a_0, o_0, s_1, a_1, o_1, \ldots \}$ 9, a lifted policy $P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\; P(\mathcal{A}_t|s_t,a_t)\; P(\mathcal{O}_t|s_t,a_t,o_{t-1})$ 0 exists such that its pushforward under $P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\; P(\mathcal{A}_t|s_t,a_t)\; P(\mathcal{O}_t|s_t,a_t,o_{t-1})$ 1 equals $P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\; P(\mathcal{A}_t|s_t,a_t)\; P(\mathcal{O}_t|s_t,a_t,o_{t-1})$ 2, ensuring

$P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}|s_t,a_t)\; P(\mathcal{A}_t|s_t,a_t)\; P(\mathcal{O}_t|s_t,a_t,o_{t-1})$ 3

Hence, solving the abstract MDP via VMOC recovers optimality in the original, unabstracted MDP, with guarantees stemming from locally compact Hausdorff action–option space representation theory.

4. The Off-Policy VMOC Algorithm and Training Procedure

VMOC operationalizes the ELBO maximization with an off-policy actor–critic architecture. Core components:

Component Type	Name/Notation	Role
Critic	$P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}\|s_t,a_t)\; P(\mathcal{A}_t\|s_t,a_t)\; P(\mathcal{O}_t\|s_t,a_t,o_{t-1})$ 4, $P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}\|s_t,a_t)\; P(\mathcal{A}_t\|s_t,a_t)\; P(\mathcal{O}_t\|s_t,a_t,o_{t-1})$ 5	$P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}\|s_t,a_t)\; P(\mathcal{A}_t\|s_t,a_t)\; P(\mathcal{O}_t\|s_t,a_t,o_{t-1})$ 6 and $P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}\|s_t,a_t)\; P(\mathcal{A}_t\|s_t,a_t)\; P(\mathcal{O}_t\|s_t,a_t,o_{t-1})$ 7
Policy	$P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}\|s_t,a_t)\; P(\mathcal{A}_t\|s_t,a_t)\; P(\mathcal{O}_t\|s_t,a_t,o_{t-1})$ 8, $P(\tau, \mathcal{A}_{1:T}, \mathcal{O}_{1:T}) \propto P(s_0) \prod_{t=0}^{T-1} P(s_{t+1}\|s_t,a_t)\; P(\mathcal{A}_t\|s_t,a_t)\; P(\mathcal{O}_t\|s_t,a_t,o_{t-1})$ 9	$P(\mathcal{A}_t = 1 \| s_t, a_t) = \exp[r(s_t, a_t)]$ 0 and $P(\mathcal{A}_t = 1 \| s_t, a_t) = \exp[r(s_t, a_t)]$ 1
Temperature	$P(\mathcal{A}_t = 1 \| s_t, a_t) = \exp[r(s_t, a_t)]$ 2, $P(\mathcal{A}_t = 1 \| s_t, a_t) = \exp[r(s_t, a_t)]$ 3	Regulation of policy entropy

Training proceeds via environment rollouts and experience replay. Key steps:

Collect transitions $P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]$ 4 into buffer $P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]$ 5.
Sample batches for updates.
Optimize critic targets:

$P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]$ 6

$P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]$ 7

Policy update via entropy-regularized gradients:

$P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]$ 8

$P(\mathcal{A}_t = 1 | s_t, a_t) = \exp[r(s_t, a_t)]$ 9

Entropy temperature tuning:

$r$ 0

$r$ 1

Adam optimizer and soft updates for target networks are used throughout.

5. Core Update Equations and Hyper-Parameter Specifications

VMOC employs $r$ 2 critic losses and policy gradients derived as $r$ 3. Default hyper-parameters as outlined:

Parameter	Value	Description
Discount ( $r$ 4)	0.99	Future reward discount
Critic-target ( $r$ 5)	0.005	Polyak averaging parameter
Batch size	256	For replay buffer sampling
Replay buffer size	$r$ 6	Experience replay capacity
Learning rate (actor/critic/temp)	$r$ 7	Adam learning rates
Adam $r$ 8	$r$ 9	Adam optimizer epsilon
Initial $P(\mathcal{O}_t = 1 \| s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]$ 0’s	0.1	Entropy regularization
Target entropies ( $P(\mathcal{O}_t = 1 \| s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]$ 1)	$P(\mathcal{O}_t = 1 \| s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]$ 2, $P(\mathcal{O}_t = 1 \| s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]$ 3	For actions, options
Latent options ( $P(\mathcal{O}_t = 1 \| s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]$ 4)	4	Number of option embeddings
Option embedding dim ( $P(\mathcal{O}_t = 1 \| s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]$ 5)	40	Embedding vector dimension

These settings support robust convergence and diversity in the learned option set.

6. Theoretical Guarantees and Optimality Preservation

Two main theoretical results underpin VMOC’s guarantee of global optimality:

Soft Option Policy Iteration (Theorem 3.2): Under exact tabular inference, iterated policy evaluation (E-step) and policy improvement (M-step) over $P(\mathcal{O}_t = 1 | s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]$ 6 converges to the unique ELBO maximizer, aligned with the true optimal soft option policies $P(\mathcal{O}_t = 1 | s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]$ 7.
Continuous HiT-MDP Homomorphism (Theorems 5.1 & 5.4): If $P(\mathcal{O}_t = 1 | s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]$ 8 defines a homomorphic abstraction, then $P(\mathcal{O}_t = 1 | s_t, a_t, o_{t-1}) = \exp[f(s_t, a_t, o_{t-1})]$ 9, and any lifted policy $f$ 0 of the abstract optimum achieves identical value in the original MDP.

Together, these results imply that VMOC’s variational inference maximizes a principled lower bound of the control objective, while the abstract optimal policy discovered in latent option space lifts to a true optimum in the original process, with no loss of optimality.

7. Applications and Empirical Validation

VMOC’s approach has been validated on both complex logical reasoning benchmarks and challenging locomotion domains. The learned options serve as latent “reasoning steps” in language settings and as abstract skills in control environments, offering an efficient, implicit alternative to explicit step-by-step reasoning such as Chain-of-Thought prompting. The framework’s combination of temporal abstraction, diversity regularization, and strong theoretical grounding supports its deployment for scalable, robust skill acquisition across modalities (Li et al., 22 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Learning Temporal Abstractions via Variational Homomorphisms in Option-Induced Abstract MDPs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Variational Markovian Option Critic (VMOC).

Variational Markovian Option Critic (VMOC)

1. Probabilistic Graphical Model and Latent Option Embeddings

2. Variational Objective and ELBO Derivation

3. Continuous HiT-MDP Homomorphisms and Optimality Guarantees

4. The Off-Policy VMOC Algorithm and Training Procedure

5. Core Update Equations and Hyper-Parameter Specifications

6. Theoretical Guarantees and Optimality Preservation

7. Applications and Empirical Validation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Variational Markovian Option Critic (VMOC)

1. Probabilistic Graphical Model and Latent Option Embeddings

2. Variational Objective and ELBO Derivation

3. Continuous HiT-MDP Homomorphisms and Optimality Guarantees

4. The Off-Policy VMOC Algorithm and Training Procedure

5. Core Update Equations and Hyper-Parameter Specifications

6. Theoretical Guarantees and Optimality Preservation

7. Applications and Empirical Validation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research