FrozenLake-Obscure: Stochastic RL Gridworld
- FrozenLake-Obscure is a reinforcement learning benchmark environment featuring a gridworld with variable slip dynamics, hidden states, and partial observability.
- The methodology employs an optimized Monte Carlo Tree Search that integrates belief-state tracking, rollout simulations, and adaptive exploration to manage uncertainty.
- Empirical benchmarks demonstrate that optimized MCTS achieves a 70% success rate and rapid convergence, outperforming policy-prior MCTS and Q-learning in stochastic scenarios.
FrozenLake-Obscure refers to a class of reinforcement learning (RL) benchmark environments that extend the classic FrozenLake domain with increased stochasticity and partial observability. These environments present a discrete gridworld with slippery dynamics, where agents must navigate from a start (S) to a goal (G) while avoiding holes (H) and traversing frozen tiles (F). In the Obscure variant, agents face additional uncertainty, such as hidden state information and unknown transition probabilities, aligning the decision-making challenge with partially observable Markov decision processes (POMDPs) (Guerra, 2024).
1. Environment Structure and Obscurity
The FrozenLake-Obscure environment is composed of a 4×4 or 8×8 grid. Each tile is labeled as Start (S), Goal (G), Frozen (F), or Hole (H). The navigation objective is to reach G from S in as few steps as possible, where falling into a hole terminates the episode with zero reward. The dynamics are characterized by stochasticity: when an action (left, right, up, down) is selected, with probability the agent transitions in a random direction (uniformly among the other three), modeling uncertainty and partial controllability in real systems.
In the “Obscure” or partially observable extension:
- The agent may lack perfect information about its current tile (e.g., due to “fog of war”).
- Certain tile types (H or F) are undisclosed until visited.
- The slip probability may differ across regions or be unknown a priori.
These properties escalate the environment’s uncertainty, necessitating methods capable of belief-state reasoning and online exploration.
2. Optimized Monte Carlo Tree Search in FrozenLake-Obscure
The control strategy prominently studied in this context is an optimized Monte Carlo Tree Search (MCTS), built on cumulative reward and visit count tables and the Upper Confidence Bound for Trees (UCT) formula. The algorithm cycles through four interleaved phases:
- Selection: At each non-terminal node (state ), select action maximizing
where is cumulative reward, is the number of times action has been taken from , is total visits to 0, and 1 is the exploration constant (default 2).
- Expansion: Expand upon encountering a state-action pair 3 with 4, initializing statistics for the new node.
- Simulation (Rollout): From the new node, perform a randomized policy (subject to slip) until termination. The episode yields 5 for reaching G, 0 otherwise.
- Backpropagation: Update tables along the path:
6
Key statistics 7, 8, and 9 are maintained in lookup tables for rapid access and update. The average value estimate used in UCT is 0.
3. Performance Benchmarks and Metrics
Empirical evaluation was conducted on OpenAI Gym FrozenLake-v0 (4×4 slippery) with 100 simulations per move and 100,000 episodes, comparing:
- Optimized MCTS
- MCTS with Policy Priors
- Q-Learning (ε-greedy, 1, 2, ε-decay)
Performance was assessed by success rate (proportion reaching G), average reward per episode, convergence rate (average steps to G), and wall-clock execution time. Table 1 summarizes the key findings:
| Algorithm | Avg. Reward | Success Rate (%) | Convergence (steps) | Time (sec) |
|---|---|---|---|---|
| Optimized MCTS | 0.80 | 70 | 40 | 48.4 |
| MCTS with Policy Priors | 0.40 | 35 | 30 | 1758.5 |
| Q-Learning | 0.80 | 60 | 50 | 42.7 |
Optimized MCTS achieves the highest success rate and rapid convergence with moderate computational resources. Reward and success rate curves indicate that Optimized MCTS converges fastest to high performance.
4. Hyperparameter Considerations
The exploration constant 3 modulates the trade-off in UCT between exploration and exploitation:
- Lower 4 (5): promotes exploitation, risk of converging on suboptimal paths, underperforms in environments with high 6.
- Higher 7 (8): encourages broad exploration, slower convergence but better in high-uncertainty scenarios, incurs higher sample complexity.
For standard slippery grids (9), 0 balances exploration and exploitation effectively. In environments with variable or hidden 1—typical of FrozenLake-Obscure—adaptive 2 (annealing from high to low) or state-dependent 3 further enhances learning by promoting exploration under uncertainty before shifting to exploitation as learning progresses.
5. Extensions to Obscure and Partially Observable Environments
FrozenLake-Obscure introduces partial observability and hidden transition probabilities, necessitating enhancements to standard MCTS:
- Belief-State MCTS: The tree is expanded in the space of belief states 4 (distributions over true states), with statistics 5 and 6 supplanting 7 and 8.
- Augmented UCT: The selection criterion becomes
9
where 0 is the expected information gain, and 1 is its weight.
- Particle Filter Rollouts: Simulations sample possible underlying maps/configurations, enabling the agent to reason about both environment structure and transition uncertainty.
- Information Gain and Curiosity Heuristics: Bonus terms are provided for actions that reduce uncertainty (e.g., visiting unobserved tiles), facilitating “curiosity-driven exploration.”
- Transition Probability Uncertainty: Upper and lower confidence bounds are maintained on transition probabilities to guide rollouts more robustly.
If full POMDP planning becomes computationally infeasible, lightweight heuristics—such as prioritizing rarely observed tile types or state observations—can approximate uncertainty-resolving behavior.
6. Practical Recommendations and Implementation Strategies
Empirical findings recommend initializing Q and N tables in the fully observable environment using Optimized MCTS, then gradually complicating the scenario by introducing partial observability and leveraging uncertainty-driven exploration heuristics. Hyperparameters such as 2, information gain weight 3, and rollout horizon should be adjusted in response to observed convergence metrics. This incremental approach supports effective transfer of learned statistics and manages the increased sample complexity introduced by obscurity.
7. Conclusion and Actionable Insights
The optimized MCTS framework, anchored by cumulative Q and N tables and an exploration constant 4, demonstrates robust performance (success rate 70%, average reward 0.80) and superior convergence properties in standard slippery FrozenLake. For the FrozenLake-Obscure class, integrating belief-state tracking, information gain bonuses, adaptive exploration/exploitation balance, and heuristics for uncertainty resolution extends the method’s applicability to POMDP-like scenarios (Guerra, 2024). With these extensions, Optimized MCTS establishes a flexible and resilient decision-making paradigm for both fully observable and obscure, partially observable gridworld domains.