Papers
Topics
Authors
Recent
Search
2000 character limit reached

FrozenLake-Obscure: Stochastic RL Gridworld

Updated 22 June 2026
  • FrozenLake-Obscure is a reinforcement learning benchmark environment featuring a gridworld with variable slip dynamics, hidden states, and partial observability.
  • The methodology employs an optimized Monte Carlo Tree Search that integrates belief-state tracking, rollout simulations, and adaptive exploration to manage uncertainty.
  • Empirical benchmarks demonstrate that optimized MCTS achieves a 70% success rate and rapid convergence, outperforming policy-prior MCTS and Q-learning in stochastic scenarios.

FrozenLake-Obscure refers to a class of reinforcement learning (RL) benchmark environments that extend the classic FrozenLake domain with increased stochasticity and partial observability. These environments present a discrete gridworld with slippery dynamics, where agents must navigate from a start (S) to a goal (G) while avoiding holes (H) and traversing frozen tiles (F). In the Obscure variant, agents face additional uncertainty, such as hidden state information and unknown transition probabilities, aligning the decision-making challenge with partially observable Markov decision processes (POMDPs) (Guerra, 2024).

1. Environment Structure and Obscurity

The FrozenLake-Obscure environment is composed of a 4×4 or 8×8 grid. Each tile is labeled as Start (S), Goal (G), Frozen (F), or Hole (H). The navigation objective is to reach G from S in as few steps as possible, where falling into a hole terminates the episode with zero reward. The dynamics are characterized by stochasticity: when an action (left, right, up, down) is selected, with probability pslipp_{slip} the agent transitions in a random direction (uniformly among the other three), modeling uncertainty and partial controllability in real systems.

In the “Obscure” or partially observable extension:

  • The agent may lack perfect information about its current tile (e.g., due to “fog of war”).
  • Certain tile types (H or F) are undisclosed until visited.
  • The slip probability pslipp_{slip} may differ across regions or be unknown a priori.

These properties escalate the environment’s uncertainty, necessitating methods capable of belief-state reasoning and online exploration.

2. Optimized Monte Carlo Tree Search in FrozenLake-Obscure

The control strategy prominently studied in this context is an optimized Monte Carlo Tree Search (MCTS), built on cumulative reward and visit count tables and the Upper Confidence Bound for Trees (UCT) formula. The algorithm cycles through four interleaved phases:

  1. Selection: At each non-terminal node (state ss), select action aa maximizing

UCT(s,a)=Q(s,a)N(s,a)+clnN(s)N(s,a)\operatorname{UCT}(s,a) = \frac{Q(s,a)}{N(s,a)} + c \cdot \sqrt{\frac{\ln N(s)}{N(s,a)}}

where Q(s,a)Q(s,a) is cumulative reward, N(s,a)N(s,a) is the number of times action aa has been taken from ss, N(s)N(s) is total visits to pslipp_{slip}0, and pslipp_{slip}1 is the exploration constant (default pslipp_{slip}2).

  1. Expansion: Expand upon encountering a state-action pair pslipp_{slip}3 with pslipp_{slip}4, initializing statistics for the new node.
  2. Simulation (Rollout): From the new node, perform a randomized policy (subject to slip) until termination. The episode yields pslipp_{slip}5 for reaching G, 0 otherwise.
  3. Backpropagation: Update tables along the path:

pslipp_{slip}6

Key statistics pslipp_{slip}7, pslipp_{slip}8, and pslipp_{slip}9 are maintained in lookup tables for rapid access and update. The average value estimate used in UCT is ss0.

3. Performance Benchmarks and Metrics

Empirical evaluation was conducted on OpenAI Gym FrozenLake-v0 (4×4 slippery) with 100 simulations per move and 100,000 episodes, comparing:

  • Optimized MCTS
  • MCTS with Policy Priors
  • Q-Learning (ε-greedy, ss1, ss2, ε-decay)

Performance was assessed by success rate (proportion reaching G), average reward per episode, convergence rate (average steps to G), and wall-clock execution time. Table 1 summarizes the key findings:

Algorithm Avg. Reward Success Rate (%) Convergence (steps) Time (sec)
Optimized MCTS 0.80 70 40 48.4
MCTS with Policy Priors 0.40 35 30 1758.5
Q-Learning 0.80 60 50 42.7

Optimized MCTS achieves the highest success rate and rapid convergence with moderate computational resources. Reward and success rate curves indicate that Optimized MCTS converges fastest to high performance.

4. Hyperparameter Considerations

The exploration constant ss3 modulates the trade-off in UCT between exploration and exploitation:

  • Lower ss4 (ss5): promotes exploitation, risk of converging on suboptimal paths, underperforms in environments with high ss6.
  • Higher ss7 (ss8): encourages broad exploration, slower convergence but better in high-uncertainty scenarios, incurs higher sample complexity.

For standard slippery grids (ss9), aa0 balances exploration and exploitation effectively. In environments with variable or hidden aa1—typical of FrozenLake-Obscure—adaptive aa2 (annealing from high to low) or state-dependent aa3 further enhances learning by promoting exploration under uncertainty before shifting to exploitation as learning progresses.

5. Extensions to Obscure and Partially Observable Environments

FrozenLake-Obscure introduces partial observability and hidden transition probabilities, necessitating enhancements to standard MCTS:

  • Belief-State MCTS: The tree is expanded in the space of belief states aa4 (distributions over true states), with statistics aa5 and aa6 supplanting aa7 and aa8.
  • Augmented UCT: The selection criterion becomes

aa9

where UCT(s,a)=Q(s,a)N(s,a)+clnN(s)N(s,a)\operatorname{UCT}(s,a) = \frac{Q(s,a)}{N(s,a)} + c \cdot \sqrt{\frac{\ln N(s)}{N(s,a)}}0 is the expected information gain, and UCT(s,a)=Q(s,a)N(s,a)+clnN(s)N(s,a)\operatorname{UCT}(s,a) = \frac{Q(s,a)}{N(s,a)} + c \cdot \sqrt{\frac{\ln N(s)}{N(s,a)}}1 is its weight.

  • Particle Filter Rollouts: Simulations sample possible underlying maps/configurations, enabling the agent to reason about both environment structure and transition uncertainty.
  • Information Gain and Curiosity Heuristics: Bonus terms are provided for actions that reduce uncertainty (e.g., visiting unobserved tiles), facilitating “curiosity-driven exploration.”
  • Transition Probability Uncertainty: Upper and lower confidence bounds are maintained on transition probabilities to guide rollouts more robustly.

If full POMDP planning becomes computationally infeasible, lightweight heuristics—such as prioritizing rarely observed tile types or state observations—can approximate uncertainty-resolving behavior.

6. Practical Recommendations and Implementation Strategies

Empirical findings recommend initializing Q and N tables in the fully observable environment using Optimized MCTS, then gradually complicating the scenario by introducing partial observability and leveraging uncertainty-driven exploration heuristics. Hyperparameters such as UCT(s,a)=Q(s,a)N(s,a)+clnN(s)N(s,a)\operatorname{UCT}(s,a) = \frac{Q(s,a)}{N(s,a)} + c \cdot \sqrt{\frac{\ln N(s)}{N(s,a)}}2, information gain weight UCT(s,a)=Q(s,a)N(s,a)+clnN(s)N(s,a)\operatorname{UCT}(s,a) = \frac{Q(s,a)}{N(s,a)} + c \cdot \sqrt{\frac{\ln N(s)}{N(s,a)}}3, and rollout horizon should be adjusted in response to observed convergence metrics. This incremental approach supports effective transfer of learned statistics and manages the increased sample complexity introduced by obscurity.

7. Conclusion and Actionable Insights

The optimized MCTS framework, anchored by cumulative Q and N tables and an exploration constant UCT(s,a)=Q(s,a)N(s,a)+clnN(s)N(s,a)\operatorname{UCT}(s,a) = \frac{Q(s,a)}{N(s,a)} + c \cdot \sqrt{\frac{\ln N(s)}{N(s,a)}}4, demonstrates robust performance (success rate 70%, average reward 0.80) and superior convergence properties in standard slippery FrozenLake. For the FrozenLake-Obscure class, integrating belief-state tracking, information gain bonuses, adaptive exploration/exploitation balance, and heuristics for uncertainty resolution extends the method’s applicability to POMDP-like scenarios (Guerra, 2024). With these extensions, Optimized MCTS establishes a flexible and resilient decision-making paradigm for both fully observable and obscure, partially observable gridworld domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FrozenLake-Obscure.