Papers
Topics
Authors
Recent
Search
2000 character limit reached

Optimistic Initial Model in Factored MDPs

Updated 29 April 2026
  • OIM is a reinforcement learning technique that initializes models with inflated reward estimates to encourage thorough exploration in structured state spaces.
  • FOIM leverages factored value iteration and per-factor count updates to achieve near-optimal policies with provable polynomial-time and sample efficiency guarantees.
  • The method ensures that, except for a polynomial number of steps, actions remain within an epsilon range of the optimal policy as established by rigorous theoretical bounds.

An Optimistic Initial Model (OIM) is a technique for reinforcement learning (RL) in which the agent’s empirical model is initialized with highly optimistic estimates, systematically biasing initial policy selection toward actions that appear maximally rewarding. In the context of factored Markov Decision Processes (FMDPs)—MDPs with structured state representations—FOIM (Factored Optimistic Initial Model) provides the first algorithm that is both “purely greedy” and guarantees polynomial time learning with respect to the fixed point of approximate value iteration (AVI), as established in (0904.3352).

1. Optimistic Initialization in Factored MDPs

FOIM initializes transition and reward factors to embody maximal optimism. For each state variable XiX_i, a fictitious “Garden-of-Eden” (GOE) substate xEx_E is introduced, expanding each domain to Xi{xE}X_i \cup \{x_E\}. Empirically, for each local context z=x[Γi]z = x[\Gamma_i], action aa, and possible next variable value yiy_i, two counts are maintained: Nivis(z,a)N_i^{\mathrm{vis}}(z,a) (context-action visit count) and Ni(z,a,yi)N_i(z,a,y_i) (transitions to yiy_i).

Initialization is as follows: Nivis(z,a)=1N_i^{\mathrm{vis}}(z, a) = 1

xEx_E0

This yields empirical transition factors with xEx_E1 initially, so every action-context pair appears to deterministically yield the GOE state. Optimism is similarly infused into rewards: an extra factor xEx_E2 with large constant reward xEx_E3 for xEx_E4 is added,

xEx_E5

Hence, all unexplored regions promise maximal possible future payoff, incentivizing thorough exploration.

2. Algorithmic Structure: FOIM Core Loop

FOIM operates by iteratively updating local empirical models and recomputing the greedy policy. At each step:

  1. Empirical Model Update: Recalculate xEx_E6 using the current counts.
  2. Factored Value Iteration (FVI): Solve the empirical factored MDP (using FVI) to tolerance xEx_E7, producing new basis weights xEx_E8.
  3. Policy Extraction: Greedily select actions,

xEx_E9

  1. Interaction: Execute Xi{xE}X_i \cup \{x_E\}0, observe Xi{xE}X_i \cup \{x_E\}1.
  2. Count Update: For each factor Xi{xE}X_i \cup \{x_E\}2, increment Xi{xE}X_i \cup \{x_E\}3 and Xi{xE}X_i \cup \{x_E\}4.
  3. Iterate: Repeat from Step 1.

Each step’s dominant computational operations—the empirical update and planner invocation—incur polynomial cost in all relevant parameters: number of factors Xi{xE}X_i \cup \{x_E\}5, local table size Xi{xE}X_i \cup \{x_E\}6, action space size Xi{xE}X_i \cup \{x_E\}7, number of basis functions Xi{xE}X_i \cup \{x_E\}8, and accuracy/confidence parameters.

3. Approximate Value Iteration with Factored Structure

FOIM employs Factored Value Iteration (FVI) as its planning subroutine: the agent’s value function is projected onto a basis Xi{xE}X_i \cup \{x_E\}9 with

z=x[Γi]z = x[\Gamma_i]0

and managed through a normalized projection matrix z=x[Γi]z = x[\Gamma_i]1. FVI applies repeated AVI steps,

z=x[Γi]z = x[\Gamma_i]2

where z=x[Γi]z = x[\Gamma_i]3 is the extended reward vector, z=x[Γi]z = x[\Gamma_i]4 the (Kronecker-factorized) transition operator, and z=x[Γi]z = x[\Gamma_i]5 is the matrix of basis functions. The Bellman backup integrates the optimistic GOE-reward bonus at each step.

4. Polynomial-Time Learning Guarantees

FOIM’s main theoretical result is that, for suitable z=x[Γi]z = x[\Gamma_i]6, with high probability, the number of time steps where the action selected is not z=x[Γi]z = x[\Gamma_i]7-close to the AVI-optimal z=x[Γi]z = x[\Gamma_i]8-value is polynomial in all problem parameters. Specifically, letting z=x[Γi]z = x[\Gamma_i]9 for local-scope size aa0, and with discount aa1 and accuracy/confidence parameters aa2, aa3,

aa4

for constant aa5 ensures,

aa6

Thus, except for polynomially many steps, FOIM’s policy is nearly as good as obtained by running approximate value iteration on the true model (0904.3352).

5. Outline of the Proof Approach

The convergence argument follows two main pillars:

(a) Persistent Optimism through Initialization and Greediness: When aa7 is sufficiently large, induction shows that at every iteration

aa8

The optimism established through initialization persists due to the monotonicity of aa9 and the per-factor maximally optimistic starting conditions.

(b) Bounded Sample Complexity via Factor-Component Counting: Each local component yiy_i0 is declared yiy_i1-known once visited yiy_i2 times. Using uniform Azuma/Hoeffding arguments, all factor-components (at most yiy_i3) eventually become known, after at most polynomially many visits. A specialized simulation lemma then ensures that once all components are known, the empirical model’s value is within yiy_i4 of the true (AVI) value; further mistakes are rare and coincident only with component discovery.

6. Computational Complexity Analysis

Each FOIM iteration involves:

  • Empirical model update: yiy_i5 operations
  • Factored Value Iteration: Each Bellman backup is yiy_i6, repeated yiy_i7 times per contraction.

Consequently, total planner time per step is yiy_i8, which is polynomial in all input and accuracy parameters. This renders FOIM, unlike prior approaches, tractable for high-dimensional structured domains.

7. Comparison to Tabular OIM and Algorithmic Implications

Tabular OIM applies the same exploration principle by injecting a GOE pseudo-transition for each global yiy_i9 and employs standard, exact value iteration. FOIM diverges in two respects:

  • Factorization of Counts: FOIM maintains per-factor transition counts, reducing storage and computation versus flat Nivis(z,a)N_i^{\mathrm{vis}}(z,a)0 tables.
  • Use of Approximate Factored Planning: FVI is used in place of exact value iteration, so the performance benchmark is not the true optimum Nivis(z,a)N_i^{\mathrm{vis}}(z,a)1, but the AVI fixed point Nivis(z,a)N_i^{\mathrm{vis}}(z,a)2.

Despite these differences, the explorative behavior—driven entirely by initial optimism, with no explicit exploration bonus—remains the same. As real experience accrues, optimism fades in locally explored regions, focusing planning on true models. FOIM is distinguished as the first factored-MDP learning algorithm that is greedy, polynomial-time per-iteration, and sample-efficient, staying Nivis(z,a)N_i^{\mathrm{vis}}(z,a)3-close to best AVI policy except for a polynomial number of steps (0904.3352).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Optimistic Initial Model (OIM).