Markov Decision Process & Unawareness

Updated 9 July 2025

Markov Decision Processes (MDPs) are models for sequential decision-making under uncertainty, now extended to include unawareness of actions and states.
The framework introduces a dedicated explore action that helps agents discover hidden actions while balancing exploitation and exploration.
Algorithmic adaptations, such as a modified R-MAX approach, enable learning near-optimal policies when the discovery process is suitably efficient.

Markov decision processes (MDPs) are foundational models for sequential decision-making under uncertainty, widely employed in diverse fields such as robotics, automated control, economics, and artificial intelligence. Classical MDP formulations typically presume that the decision maker (DM) possesses complete knowledge of all possible states and actions. However, in many realistic circumstances, the DM may be unaware of all available actions and states. The framework of MDPs with Unawareness (MDPUs) introduces a precise mathematical model for such settings, characterizes the conditions for efficient learning of near-optimal policies, and develops algorithms capable of learning near-optimality when possible (Halpern et al., 2010).

1. Unawareness and Limitations of the Traditional MDP Framework

Classical MDPs are defined by the tuple $(S, A, P, R)$ , with:

$S$ : set of all possible states,
$A$ : set of all actions,
$P$ : transition probabilities,
$R$ : reward function.

A fundamental assumption is that the DM knows $S$ and $A$ entirely. This strong assumption fails in practical scenarios:

Robotics: The set of motor primitives or control actions may not be fully specified at design time.
Games and Exploration: A player or agent may gradually discover more effective actions as experience accumulates.
Finance and Automated Planning: Complex decisions may involve "hidden" actions that are only revealed through deliberate exploration.

MDPUs address these limitations by formally modeling the DM's subjective incompleteness about actions.

2. Mathematical Definition of MDPs with Unawareness

An MDPU is represented as the tuple:

$M = (S,\ A,\ S_0, a_0,\ g_A,\ g_0,\ P,\ D,\ R,\ R^+,\ R^-)$

with:

$S$ : The full set of underlying (objective) states.
$A$ : The complete set of actions in the system.
$S_0 \subseteq S$ : The set of states initially known to the DM.
$a_0$ : A special "explore" action, distinct from elements of $A$ , always available.
$g_A: S \rightarrow 2^A$ : Assignment giving the true available action set at each $s\in S$ .
$g_0: S_0 \rightarrow 2^A$ : The set of actions (besides $a_0$ ) known to the DM at $s\in S_0$ ; by convention, the DM is always aware of $a_0$ .
$P(s, s', a)$ : Transition probabilities for $s \in S$ , $a \in g_A(s)$ ; for $a_0$ , $P(s, s, a_0) = 1$ .
$D(j, t, s)$ : Discovery probability, the chance that, when $j$ actions remain to be found at state $s$ , the $t$ -th play of $a_0$ reveals a new action. In much of the analysis, $D$ is state-independent: $D(j, t)$ .
$R, R^+, R^-$ : Reward for actions. $R$ is standard, assigned for all $a\in g_A(s)$ . $R^+(s)$ is awarded upon discovery during exploration; $R^-(s)$ is awarded when $a_0$ reveals nothing new.

This formalism embeds both the DM's limited initial awareness and a precise stochastic process governing action discovery.

3. The Role of Exploration and the Discovery Process

A key innovation of MDPUs is the introduction of a system-level explore action ( $a_0$ ), enabling the DM to search for previously unknown actions at each state. When $a_0$ is played:

The DM remains at the same state ( $P(s, s, a_0)=1$ ).
With probability $D(j, t)$ (where $j$ is the number of undiscovered actions at $s$ ), a new action is discovered.
Upon discovery, the new action is added to the DM's set of available actions at the current state.

This mechanism allows the agent to actively manage the tradeoff between:

Exploitation: Utilizing known actions for immediate reward.
Exploration: Investing time into discovering new, possibly superior, actions.

Crucially, the probability function $D(j, t)$ formalizes the difficulty or ease of discovering actions and directly governs the sample complexity and feasibility of learning in MDPUs.

4. Learning Near-Optimal Policies: Algorithmic Foundations

The primary algorithmic framework for MDPUs adapts the classical R-MAX algorithm, integrating it with mechanisms for learning both the transition model and the action set.

Key parameters:

$K_1(T)$ : The number of samples required for each known $(s,a)$ pair ( $a \neq a_0$ ) so that transition probabilities and rewards are estimated with high confidence.

$K_1(T) = \max\left( \left\lceil \left( \frac{4NTR_{\max}}{\epsilon} \right)^3 \right\rceil, \left\lceil 8 \ln^3 \left( \frac{8Nk}{\delta}\right) \right\rceil \right) + 1$

where $N$ , $k$ , $R_{\max}$ , $T$ , $\epsilon$ , $\delta$ are, respectively, estimates of the state cardinality, action cardinality, maximum reward, mixing time, target accuracy, and failure probability.

$K_0$ : The number of plays required of $a_0$ in a state to guarantee (with high probability) discovery of an undiscovered action, if one exists:

$K_0 = \min\left\{ M :\ \sum_{t=1}^{M} D(1, t) \ge \ln\left( \frac{4N}{\delta} \right) \right\}$

Algorithm outline:

Initialize: Set known states/actions to $S_0$ , $g_0$ .
Exploration: For each $s$ , play $a_0$ up to $K_0$ times. Each time a new action is discovered, add to $g_0(s)$ .
Model estimation: For each known $(s, a)$ , gather $K_1(T)$ samples to estimate $P$ and $R$ .
Planning: Construct an empirical MDP with currently known states/actions. Compute a (near-)optimal policy with respect to this sub-MDP.
Recovery and restart: If observed states, actions, or rewards exceed current guesses, increment parameters and restart.
Termination and guarantees: If the sum of $D(1, t)$ diverges and does so "sufficiently fast" (e.g., logarithmically in $T$ ), near-optimal policies can be learned in time polynomial in $N$ , $k$ , $T$ , $1/\epsilon$ , and $1/\delta$ . If not, no general guarantee is possible.

5. Behavioral and Computational Implications

The MDPU framework yields a rich set of behavioral consequences and computational guarantees:

Exploration-Exploitation-Discovery Tradeoff: DMs must not only balance immediate reward (exploitation) and information-gathering (exploration), but also decide how to allocate effort towards uncovering new capabilities (discovery).
Consequences of Slow Discovery: If $D(1, t)$ decays too quickly (for example, $D(1, t) = 1/(t+1)^2$ ), the probability of discovering a missing action remains bounded away from $1$ even after infinite exploration, making optimal or near-optimal play unattainable.
Sample Complexity: If $\sum_t D(1, t) = \infty$ and, for some $m_1, m_2 > 0$ , $\sum_{t=1}^T D(1, t) \geq m_1 \ln(T) + m_2$ , then $K_0$ scales only logarithmically (hence, polynomially overall), making efficient learning feasible.
Restart Strategy: Because the DM does not know $|S|$ , $|A|$ , or $R_\text{max}$ a priori, the algorithm incrementally increases these parameters and restarts if violations are observed (e.g., more actions discovered than allowed by current guess).
Stopping Rule: Once all actions are likely discovered (with probability at least $1 - \delta$ ), the exploration with $a_0$ can be curtailed, and further resources shifted towards refining estimates for optimal control with the known action set.

6. Applications and Broader Relevance

The MDPU model and algorithmic approach provide a rigorous basis for systems where full a priori specification of the action set is not possible:

Robotics: Where action discovery equates to learning new movement primitives or control strategies beyond what is hard-coded.
Game Playing and Video Games: Modeling how human or AI agents uncover and exploit "hidden moves" or strategies during play.
Mathematical Reasoning: Formalizing the exploration of new proof techniques or problem-solving heuristics.
Large or Abstract Action Spaces: Providing a viable learning protocol when the full enumeration of actions is computationally intractable or conceptually undefined.

This framework generalizes classical MDPs by reflecting situations where actively exploring the unknown is essential to optimal decision making, fundamentally altering the landscape of learning and planning algorithms.

7. Summary Table: Key Innovations of the MDPU Framework

Feature	Standard MDP	MDPU
Known actions	Complete	Partial, subject to discovery
Explore action ( $a_0$ )	Absent	Explicitly modeled
Discovery mechanism	N/A	Stochastic, via $D(j, t)$
Sample complexity (learning)	Polynomial	May be polynomial or >polynomial, depending on $D$
Algorithmic adaptation	Standard RL	Modified R-MAX (explore, adjust, restart)
Can guarantee optimality?	Yes, in principle	Only if discovery is not "too hard"

In conclusion, MDPUs represent a significant extension of the classical Markov decision process paradigm, supplying a formal treatment of unawareness, new algorithmic constructions, and precise conditions under which learning is feasible and efficient. The framework is particularly impactful for real-world applications where the set of relevant actions cannot be assumed to be known at the outset, and where strategic exploration is essential for competent behavior (Halpern et al., 2010).

PDF Markdown Chat (Upgrade)

References (1)

1.

MDPs with Unawareness (2010)

Follow-up Questions

We haven't generated follow-up questions for this topic yet.

Generate Now