Multi-Stage Temporal Difference Learning for 2048-like Games
(1606.07374v2)
Published 23 Jun 2016 in cs.LG
Abstract: Szubert and Jaskowski successfully used temporal difference (TD) learning together with n-tuple networks for playing the game 2048. However, we observed a phenomenon that the programs based on TD learning still hardly reach large tiles. In this paper, we propose multi-stage TD (MS-TD) learning, a kind of hierarchical reinforcement learning method, to effectively improve the performance for the rates of reaching large tiles, which are good metrics to analyze the strength of 2048 programs. Our experiments showed significant improvements over the one without using MS-TD learning. Namely, using 3-ply expectimax search, the program with MS-TD learning reached 32768-tiles with a rate of 18.31%, while the one with TD learning did not reach any. After further tuned, our 2048 program reached 32768-tiles with a rate of 31.75% in 10,000 games, and one among these games even reached a 65536-tile, which is the first ever reaching a 65536-tile to our knowledge. In addition, MS-TD learning method can be easily applied to other 2048-like games, such as Threes. Based on MS-TD learning, our experiments for Threes also demonstrated similar performance improvement, where the program with MS-TD learning reached 6144-tiles with a rate of 7.83%, while the one with TD learning only reached 0.45%.
The paper demonstrates that splitting the learning into stages significantly improves the rate of reaching large tiles compared to standard TD methods.
It employs event-triggered stage transitions with specific feature weights to adapt learning dynamically as game states evolve.
Integrating the stage-aware value functions with expectimax search enhances decision-making while balancing computational cost.
This paper (Yeh et al., 2016) addresses a limitation of standard Temporal Difference (TD) learning when applied to 2048-like games: while effective at maximizing average scores, it often struggles to consistently reach very large tiles (like 32768 or 65536 in 2048) which are indicators of strong play. The authors propose Multi-Stage Temporal Difference (MS-TD) learning as a hierarchical reinforcement learning approach to overcome this.
The core idea of MS-TD learning is to divide the learning process into multiple stages, where each stage focuses on improving play after a certain level of game progression has been reached. This progression is defined by the creation of specific large tiles. For 2048, a simple 3-stage strategy uses splitting points based on the first time a 16384-tile (T16k) is created and the first time both a 16384-tile and an 8192-tile (T16+8k) are present simultaneously.
Implementation of MS-TD Learning:
Define Stages and Splitting Criteria: Determine the thresholds (specific large tile combinations) that define the transitions between stages. These are event-based, triggered by the game state reaching a certain configuration of large tiles. For 2048, examples include T16k (first 16384-tile) and T16+8k (first 16384-tile and 8192-tile). For Threes, thresholds like T1536 and T3072 are used.
Stage-Specific Feature Weights: Each stage utilizes its own independent set of feature weights. These weights are typically initialized to zero for each new stage.
Training Procedure:
Stage 1: Train a TD learning agent using a standard approach (like TD(0) evaluating afterstates with n-tuple networks) on many games from the initial state until the learning saturates (average scores stabilize).
Collecting Data for Next Stage: After Stage 1 training is complete, continue playing games using the Stage 1 weights. Whenever the game state first meets the criterion for the next stage's splitting point (e.g., T16k), save this board state and the total score accumulated up to that point. Collect a sufficient number of such board samples (e.g., 100,000).
Subsequent Stages (Stage 2, 3, etc.): Initialize a new set of feature weights for this stage. Train a TD learning agent starting from the collected boards from the previous stage's splitting point. The training games for this stage begin with one of the collected boards as the initial state, and the scores accumulated before the splitting point in the original game are added to the rewards obtained during the training in this stage. The collected boards are typically used repeatedly in a round-robin manner. This process is repeated for each subsequent stage.
Playing Procedure: When playing a live game, the program tracks which stage the current board state corresponds to based on the defined splitting criteria. It then uses the specific set of feature weights trained for that stage to evaluate board positions and guide move selection.
Feature Representation:
The paper utilizes n-tuple networks, which evaluate a state by summing up weights associated with specific configurations of tiles within defined tuples (groups of cells) on the board. The authors modified the n-tuple configurations from previous work [21], using larger 6-tuples and covering different patterns (Fig 4(b)). Critically, they also introduced large-tile features. These are explicit features representing the counts or presence of specific large tiles (e.g., number of 2048s, 4096s, 8192s, etc., on the board). These large-tile features are crucial because they directly signal the "difficulty" or stage of the game, allowing the different stages' feature weights to learn values appropriate for states containing these large tiles.
Integrating with Expectimax Search:
MS-TD learned values represent the expected future score from a given afterstate (a board state after a player move but before a new tile appears). These learned values V(s′) are used as the heuristic evaluation function for the leaf nodes in an expectimax search tree.
Max Nodes: Represent states where the player makes a move. The value is the maximum of the values of the resulting afterstates.
Chance Nodes: Represent states where a new tile appears randomly. The value is the weighted average of the values of the resulting states, based on the probability of each new tile/position combination (e.g., 90% chance of a '2' tile, 10% chance of a '4' tile, placed randomly in an empty cell).
Evaluation: For a search depth of k ply, the search explores k player moves and the subsequent chance node outcomes. At the search horizon, the board state is evaluated using the learned MS-TD value function V(s′). This V(s′) is selected from the set of feature weights corresponding to the current stage of the game. The expectimax algorithm backfills values up the tree, and the player chooses the move from the root max node that leads to the highest expectimax value.
A 1-ply expectimax search is equivalent to simply choosing the move that leads to the afterstate with the highest learned MS-TD value. Deeper search allows the agent to look ahead further.
Practical Results and Considerations:
Improved Large Tile Reaching Rates: MS-TD significantly improved the rate of reaching large tiles compared to standard TD. For 2048, the 32768-tile reaching rate increased from 0% (standard TD) to 13.78% (MS-TD Strategy 3, 3-ply expectimax). After further improvements, this rate reached 31.75%, and a 65536-tile was observed. For Threes, the 6144-tile rate improved from 0.45% to 7.83% (simple 3-stage, 3-ply expectimax). This validates the paper's hypothesis that MS-TD is better suited for optimizing performance metrics tied to difficult, later-game states.
Splitting Strategy: The choice of splitting points matters. Too few stages might not capture the changing difficulty, while too many might offer diminishing returns or increase complexity without significant performance gains. Empirically, Strategy 3 (splitting at T16k, T16+8k, T16+8+4k, T16+8+4+2k) performed well for 2048. The optimal strategy is likely game-dependent and requires tuning.
Computational Cost: MS-TD requires training multiple sets of feature weights, one for each stage. This increases the total training time and memory required to store the weights. Playing a game also involves accessing multiple weight tables and determining the current stage, introducing some slight overhead compared to a single-stage TD system. Collecting the samples for subsequent stages also requires playing a large number of games in the preceding stage.
Feature Engineering: The inclusion of explicit large-tile features was important for the performance of both the standard TD and MS-TD agents. This highlights the continued relevance of domain-specific feature engineering even in learning-based approaches.
Further Improvements: Techniques like adding more general game features (empty cells, mergeable pairs), carefully adjusting the learning rate (decaying or using smaller values after saturation), and using variants of TD(λ) (like the described TD(2) variant with limited n-step returns and offline updates) can further boost performance. The choice to use offline updates for TD(2) was a practical consideration given the sparsity of features and the length of games, simplifying implementation compared to online eligibility traces.
Expectimax Depth: Deeper expectimax search generally improves performance but significantly decreases the speed (moves/sec). There's a clear trade-off between playing strength (higher ply) and computational feasibility.
In summary, MS-TD learning provides a practical framework for applying TD learning to games like 2048 and Threes where the nature of states changes significantly as the game progresses and larger tiles appear. By segmenting the learning process based on game milestones and training stage-specific value functions, it effectively addresses the challenge of optimizing for rare, high-value outcomes like reaching maximum tiles, complementing the average-score optimization of standard TD. Implementing MS-TD involves careful definition of stages, managing multiple sets of weights, collecting representative game states at stage transitions, and integrating the stage-aware value function into a search algorithm like expectimax.