Optimistic Initial Model in Factored MDPs
- OIM is a reinforcement learning technique that initializes models with inflated reward estimates to encourage thorough exploration in structured state spaces.
- FOIM leverages factored value iteration and per-factor count updates to achieve near-optimal policies with provable polynomial-time and sample efficiency guarantees.
- The method ensures that, except for a polynomial number of steps, actions remain within an epsilon range of the optimal policy as established by rigorous theoretical bounds.
An Optimistic Initial Model (OIM) is a technique for reinforcement learning (RL) in which the agent’s empirical model is initialized with highly optimistic estimates, systematically biasing initial policy selection toward actions that appear maximally rewarding. In the context of factored Markov Decision Processes (FMDPs)—MDPs with structured state representations—FOIM (Factored Optimistic Initial Model) provides the first algorithm that is both “purely greedy” and guarantees polynomial time learning with respect to the fixed point of approximate value iteration (AVI), as established in (0904.3352).
1. Optimistic Initialization in Factored MDPs
FOIM initializes transition and reward factors to embody maximal optimism. For each state variable , a fictitious “Garden-of-Eden” (GOE) substate is introduced, expanding each domain to . Empirically, for each local context , action , and possible next variable value , two counts are maintained: (context-action visit count) and (transitions to ).
Initialization is as follows:
0
This yields empirical transition factors with 1 initially, so every action-context pair appears to deterministically yield the GOE state. Optimism is similarly infused into rewards: an extra factor 2 with large constant reward 3 for 4 is added,
5
Hence, all unexplored regions promise maximal possible future payoff, incentivizing thorough exploration.
2. Algorithmic Structure: FOIM Core Loop
FOIM operates by iteratively updating local empirical models and recomputing the greedy policy. At each step:
- Empirical Model Update: Recalculate 6 using the current counts.
- Factored Value Iteration (FVI): Solve the empirical factored MDP (using FVI) to tolerance 7, producing new basis weights 8.
- Policy Extraction: Greedily select actions,
9
- Interaction: Execute 0, observe 1.
- Count Update: For each factor 2, increment 3 and 4.
- Iterate: Repeat from Step 1.
Each step’s dominant computational operations—the empirical update and planner invocation—incur polynomial cost in all relevant parameters: number of factors 5, local table size 6, action space size 7, number of basis functions 8, and accuracy/confidence parameters.
3. Approximate Value Iteration with Factored Structure
FOIM employs Factored Value Iteration (FVI) as its planning subroutine: the agent’s value function is projected onto a basis 9 with
0
and managed through a normalized projection matrix 1. FVI applies repeated AVI steps,
2
where 3 is the extended reward vector, 4 the (Kronecker-factorized) transition operator, and 5 is the matrix of basis functions. The Bellman backup integrates the optimistic GOE-reward bonus at each step.
4. Polynomial-Time Learning Guarantees
FOIM’s main theoretical result is that, for suitable 6, with high probability, the number of time steps where the action selected is not 7-close to the AVI-optimal 8-value is polynomial in all problem parameters. Specifically, letting 9 for local-scope size 0, and with discount 1 and accuracy/confidence parameters 2, 3,
4
for constant 5 ensures,
6
Thus, except for polynomially many steps, FOIM’s policy is nearly as good as obtained by running approximate value iteration on the true model (0904.3352).
5. Outline of the Proof Approach
The convergence argument follows two main pillars:
(a) Persistent Optimism through Initialization and Greediness: When 7 is sufficiently large, induction shows that at every iteration
8
The optimism established through initialization persists due to the monotonicity of 9 and the per-factor maximally optimistic starting conditions.
(b) Bounded Sample Complexity via Factor-Component Counting: Each local component 0 is declared 1-known once visited 2 times. Using uniform Azuma/Hoeffding arguments, all factor-components (at most 3) eventually become known, after at most polynomially many visits. A specialized simulation lemma then ensures that once all components are known, the empirical model’s value is within 4 of the true (AVI) value; further mistakes are rare and coincident only with component discovery.
6. Computational Complexity Analysis
Each FOIM iteration involves:
- Empirical model update: 5 operations
- Factored Value Iteration: Each Bellman backup is 6, repeated 7 times per contraction.
Consequently, total planner time per step is 8, which is polynomial in all input and accuracy parameters. This renders FOIM, unlike prior approaches, tractable for high-dimensional structured domains.
7. Comparison to Tabular OIM and Algorithmic Implications
Tabular OIM applies the same exploration principle by injecting a GOE pseudo-transition for each global 9 and employs standard, exact value iteration. FOIM diverges in two respects:
- Factorization of Counts: FOIM maintains per-factor transition counts, reducing storage and computation versus flat 0 tables.
- Use of Approximate Factored Planning: FVI is used in place of exact value iteration, so the performance benchmark is not the true optimum 1, but the AVI fixed point 2.
Despite these differences, the explorative behavior—driven entirely by initial optimism, with no explicit exploration bonus—remains the same. As real experience accrues, optimism fades in locally explored regions, focusing planning on true models. FOIM is distinguished as the first factored-MDP learning algorithm that is greedy, polynomial-time per-iteration, and sample-efficient, staying 3-close to best AVI policy except for a polynomial number of steps (0904.3352).