Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Markov Decision Process & Unawareness

Updated 9 July 2025
  • Markov Decision Processes (MDPs) are models for sequential decision-making under uncertainty, now extended to include unawareness of actions and states.
  • The framework introduces a dedicated explore action that helps agents discover hidden actions while balancing exploitation and exploration.
  • Algorithmic adaptations, such as a modified R-MAX approach, enable learning near-optimal policies when the discovery process is suitably efficient.

Markov decision processes (MDPs) are foundational models for sequential decision-making under uncertainty, widely employed in diverse fields such as robotics, automated control, economics, and artificial intelligence. Classical MDP formulations typically presume that the decision maker (DM) possesses complete knowledge of all possible states and actions. However, in many realistic circumstances, the DM may be unaware of all available actions and states. The framework of MDPs with Unawareness (MDPUs) introduces a precise mathematical model for such settings, characterizes the conditions for efficient learning of near-optimal policies, and develops algorithms capable of learning near-optimality when possible (1006.2204).

1. Unawareness and Limitations of the Traditional MDP Framework

Classical MDPs are defined by the tuple (S,A,P,R)(S, A, P, R), with:

  • SS: set of all possible states,
  • AA: set of all actions,
  • PP: transition probabilities,
  • RR: reward function.

A fundamental assumption is that the DM knows SS and AA entirely. This strong assumption fails in practical scenarios:

  • Robotics: The set of motor primitives or control actions may not be fully specified at design time.
  • Games and Exploration: A player or agent may gradually discover more effective actions as experience accumulates.
  • Finance and Automated Planning: Complex decisions may involve "hidden" actions that are only revealed through deliberate exploration.

MDPUs address these limitations by formally modeling the DM's subjective incompleteness about actions.

2. Mathematical Definition of MDPs with Unawareness

An MDPU is represented as the tuple:

M=(S, A, S0,a0, gA, g0, P, D, R, R+, R)M = (S,\ A,\ S_0, a_0,\ g_A,\ g_0,\ P,\ D,\ R,\ R^+,\ R^-)

with:

  • SS: The full set of underlying (objective) states.
  • AA: The complete set of actions in the system.
  • S0SS_0 \subseteq S: The set of states initially known to the DM.
  • a0a_0: A special "explore" action, distinct from elements of AA, always available.
  • gA:S2Ag_A: S \rightarrow 2^A: Assignment giving the true available action set at each sSs\in S.
  • g0:S02Ag_0: S_0 \rightarrow 2^A: The set of actions (besides a0a_0) known to the DM at sS0s\in S_0; by convention, the DM is always aware of a0a_0.
  • P(s,s,a)P(s, s', a): Transition probabilities for sSs \in S, agA(s)a \in g_A(s); for a0a_0, P(s,s,a0)=1P(s, s, a_0) = 1.
  • D(j,t,s)D(j, t, s): Discovery probability, the chance that, when jj actions remain to be found at state ss, the tt-th play of a0a_0 reveals a new action. In much of the analysis, DD is state-independent: D(j,t)D(j, t).
  • R,R+,RR, R^+, R^-: Reward for actions. RR is standard, assigned for all agA(s)a\in g_A(s). R+(s)R^+(s) is awarded upon discovery during exploration; R(s)R^-(s) is awarded when a0a_0 reveals nothing new.

This formalism embeds both the DM's limited initial awareness and a precise stochastic process governing action discovery.

3. The Role of Exploration and the Discovery Process

A key innovation of MDPUs is the introduction of a system-level explore action (a0a_0), enabling the DM to search for previously unknown actions at each state. When a0a_0 is played:

  • The DM remains at the same state (P(s,s,a0)=1P(s, s, a_0)=1).
  • With probability D(j,t)D(j, t) (where jj is the number of undiscovered actions at ss), a new action is discovered.
  • Upon discovery, the new action is added to the DM's set of available actions at the current state.

This mechanism allows the agent to actively manage the tradeoff between:

  • Exploitation: Utilizing known actions for immediate reward.
  • Exploration: Investing time into discovering new, possibly superior, actions.

Crucially, the probability function D(j,t)D(j, t) formalizes the difficulty or ease of discovering actions and directly governs the sample complexity and feasibility of learning in MDPUs.

4. Learning Near-Optimal Policies: Algorithmic Foundations

The primary algorithmic framework for MDPUs adapts the classical R-MAX algorithm, integrating it with mechanisms for learning both the transition model and the action set.

Key parameters:

  • K1(T)K_1(T): The number of samples required for each known (s,a)(s,a) pair (aa0a \neq a_0) so that transition probabilities and rewards are estimated with high confidence.

K1(T)=max((4NTRmaxϵ)3,8ln3(8Nkδ))+1K_1(T) = \max\left( \left\lceil \left( \frac{4NTR_{\max}}{\epsilon} \right)^3 \right\rceil, \left\lceil 8 \ln^3 \left( \frac{8Nk}{\delta}\right) \right\rceil \right) + 1

where NN, kk, RmaxR_{\max}, TT, ϵ\epsilon, δ\delta are, respectively, estimates of the state cardinality, action cardinality, maximum reward, mixing time, target accuracy, and failure probability.

  • K0K_0: The number of plays required of a0a_0 in a state to guarantee (with high probability) discovery of an undiscovered action, if one exists:

K0=min{M: t=1MD(1,t)ln(4Nδ)}K_0 = \min\left\{ M :\ \sum_{t=1}^{M} D(1, t) \ge \ln\left( \frac{4N}{\delta} \right) \right\}

Algorithm outline:

  1. Initialize: Set known states/actions to S0S_0, g0g_0.
  2. Exploration: For each ss, play a0a_0 up to K0K_0 times. Each time a new action is discovered, add to g0(s)g_0(s).
  3. Model estimation: For each known (s,a)(s, a), gather K1(T)K_1(T) samples to estimate PP and RR.
  4. Planning: Construct an empirical MDP with currently known states/actions. Compute a (near-)optimal policy with respect to this sub-MDP.
  5. Recovery and restart: If observed states, actions, or rewards exceed current guesses, increment parameters and restart.
  6. Termination and guarantees: If the sum of D(1,t)D(1, t) diverges and does so "sufficiently fast" (e.g., logarithmically in TT), near-optimal policies can be learned in time polynomial in NN, kk, TT, 1/ϵ1/\epsilon, and 1/δ1/\delta. If not, no general guarantee is possible.

5. Behavioral and Computational Implications

The MDPU framework yields a rich set of behavioral consequences and computational guarantees:

  • Exploration-Exploitation-Discovery Tradeoff: DMs must not only balance immediate reward (exploitation) and information-gathering (exploration), but also decide how to allocate effort towards uncovering new capabilities (discovery).
  • Consequences of Slow Discovery: If D(1,t)D(1, t) decays too quickly (for example, D(1,t)=1/(t+1)2D(1, t) = 1/(t+1)^2), the probability of discovering a missing action remains bounded away from $1$ even after infinite exploration, making optimal or near-optimal play unattainable.
  • Sample Complexity: If tD(1,t)=\sum_t D(1, t) = \infty and, for some m1,m2>0m_1, m_2 > 0, t=1TD(1,t)m1ln(T)+m2\sum_{t=1}^T D(1, t) \geq m_1 \ln(T) + m_2, then K0K_0 scales only logarithmically (hence, polynomially overall), making efficient learning feasible.
  • Restart Strategy: Because the DM does not know S|S|, A|A|, or RmaxR_\text{max} a priori, the algorithm incrementally increases these parameters and restarts if violations are observed (e.g., more actions discovered than allowed by current guess).
  • Stopping Rule: Once all actions are likely discovered (with probability at least 1δ1 - \delta), the exploration with a0a_0 can be curtailed, and further resources shifted towards refining estimates for optimal control with the known action set.

6. Applications and Broader Relevance

The MDPU model and algorithmic approach provide a rigorous basis for systems where full a priori specification of the action set is not possible:

  • Robotics: Where action discovery equates to learning new movement primitives or control strategies beyond what is hard-coded.
  • Game Playing and Video Games: Modeling how human or AI agents uncover and exploit "hidden moves" or strategies during play.
  • Mathematical Reasoning: Formalizing the exploration of new proof techniques or problem-solving heuristics.
  • Large or Abstract Action Spaces: Providing a viable learning protocol when the full enumeration of actions is computationally intractable or conceptually undefined.

This framework generalizes classical MDPs by reflecting situations where actively exploring the unknown is essential to optimal decision making, fundamentally altering the landscape of learning and planning algorithms.

7. Summary Table: Key Innovations of the MDPU Framework

Feature Standard MDP MDPU
Known actions Complete Partial, subject to discovery
Explore action (a0a_0) Absent Explicitly modeled
Discovery mechanism N/A Stochastic, via D(j,t)D(j, t)
Sample complexity (learning) Polynomial May be polynomial or >polynomial, depending on DD
Algorithmic adaptation Standard RL Modified R-MAX (explore, adjust, restart)
Can guarantee optimality? Yes, in principle Only if discovery is not "too hard"

In conclusion, MDPUs represent a significant extension of the classical Markov decision process paradigm, supplying a formal treatment of unawareness, new algorithmic constructions, and precise conditions under which learning is feasible and efficient. The framework is particularly impactful for real-world applications where the set of relevant actions cannot be assumed to be known at the outset, and where strategic exploration is essential for competent behavior (1006.2204).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)