- The paper introduces AssistanceZero, a scalable algorithm that leverages learned models and MCTS to solve assistance games in a complex 3D grid world.
- It demonstrates significant improvements over PPO, achieving 79.8% goal completion and drastically reducing human intervention in the MBAG benchmark.
- It outlines critical considerations for human modeling and simulation design, offering a robust alternative to RLHF for training collaborative AI assistants.
Assistance games offer a promising alternative to the standard @@@@1@@@@ from human feedback (RLHF) paradigm for training AI assistants. Unlike RLHF, which trains an assistant to maximize potentially manipulable human feedback, assistance games model the interaction between a human user and an AI assistant as a cooperative two-player game where the assistant is initially uncertain about the shared goal. This framework naturally incentivizes the assistant to learn about the user's objective through interaction and encourages complementary actions rather than mere prediction or replacement of human actions.
However, applying assistance games to complex, real-world-like scenarios has been challenging due to two main difficulties:
- Intractable Decision Making: Solving assistance games requires the assistant to reason and plan under uncertainty over a vast space of possible goals, which is computationally demanding.
- Accurate Human Modeling: Effective assistance depends on the assistant accurately modeling the human user's behavior, which is complex and can deviate from simple optimality assumptions.
This paper tackles these challenges by introducing AssistanceZero, a scalable algorithm for solving assistance games, and evaluating it in the Minecraft Building Assistance Game (MBAG), a new, complex benchmark.
The Minecraft Building Assistance Game (MBAG)
MBAG is designed to be a more realistic testbed for assistance games.
- Environment: A 3D grid-world where human and assistant agents can move, place blocks, and break blocks.
- State: Includes the block configuration of the grid, player locations, and player inventories.
- Action Space: Consists of navigation and parameterized place/break actions, resulting in a very large action space (over 20,000 possible actions in an 11×10×10 grid).
- Reward Function: Shared between the human and assistant, based on the reduction in edit distance to a target goal structure. Correct actions yield +1, incorrect actions -1.
- Goals: Sampled from a complex distribution of house structures derived from the CraftAssist dataset [gray_craftassist_2019]. The number of possible goals is extremely large (>10400). The human knows the goal, but the assistant does not.
- Implementation: The environment is implemented as a high-speed Python/C simulator capable of syncing with a real Minecraft instance via the Malmo mod [johnson_malmo_2016] for visualization and human studies. The version used in the paper provides unlimited blocks to simplify resource management.
MBAG fulfills the desiderata of having a complex, structured goal distribution and requiring different levels of goal information for effective assistance (e.g., basic foundation work needs less info than complex details).
AssistanceZero: Scalable Assistance Game Solving
The paper demonstrates that standard model-free deep RL algorithms like PPO struggle in MBAG. PPO, even with modifications like only rewarding the assistant's own actions or adding auxiliary losses (see Appendix Table 5), shows minimal helpfulness (Assistant goal % < 8%, Human actions reduced by ~4-5). This is attributed to the highly noisy and delayed reward signal in the assistance game setting, making it difficult for PPO to simultaneously learn about the hidden goal and plan effectively.
AssistanceZero addresses this by extending the AlphaZero framework [silver_mastering_2017] to handle the partial observability and multi-agent nature of assistance games.
- Core Idea: Separate goal prediction from action selection using a learned model and Monte Carlo Tree Search (MCTS).
- Neural Network: A recurrent neural network takes the state-action history (h) as input and has four heads:
- Policy head (πϕ(aR∣h)) for suggesting assistant actions.
- Value head (V^ϕ(h)) for estimating the value of a state/history.
- Reward Parameter Prediction head (p^ϕ(θ∣h)) for predicting the distribution over the hidden goal parameters (block types at each location in MBAG).
- Human Action Prediction head (p^ϕ(aH∣h)) for predicting the human's next action.
- MCTS: Uses the learned predictions (p^ϕ(θ∣h), p^ϕ(aH∣h)) to simulate future trajectories under uncertainty. It estimates rewards by marginalizing over the predicted goal distribution and simulates human actions using the predicted human policy. MCTS guides the assistant's action selection.
- Training: Alternates between collecting trajectories using MCTS with the current network and training the network using a multi-component loss function (Equation 1):
- KL divergence between MCTS policy output and network policy head.
- Squared error between MCTS value estimate and network value head.
- Negative log-likelihood for predicting the true goal parameters.
- A KL penalty between current and previous goal predictions to prevent overfitting to recent history.
- Negative log-likelihood for predicting the human's action.
- Efficiency: Uses a variant of MCTS similar to POMCP [silver_monte-carlo_2010] but with a learned model. Employs a "bi-level" action selection in MCTS for structured action spaces like MBAG (Appendix E).
AssistanceZero significantly outperforms PPO in simulation (Table 1), achieving higher overall goal completion (79.8% vs 71.6%) and drastically reducing human actions (158 vs 203) while building a substantial portion of the structure itself (27% vs <8%).
Human Modeling for Assistance Games
The choice of human model used during training is crucial for an assistant's performance with real users. The paper explores several modeling approaches:
- Reward-based: Training agents (PPO, AlphaZero) to solve the task alone, assuming humans act optimally or Boltzmann-rationally.
- Data-based (Behavior Cloning - BC): Training a model to imitate human actions recorded in the environment. BC models were trained on human data building alone, with an assistant, and a combination of both. Recurrent networks, data augmentation, and dropout were found important for BC performance (Appendix Table 7).
- Combined (piKL): Using MCTS with a BC-trained policy as a prior [jacob_modeling_2022], balancing reward maximization and human-likeness.
Evaluation showed that pure reward-based models poorly predict human actions and perform the task much faster than humans. BC models predict actions well but tend to suffer from compounding errors when playing alone. piKL models offer the best balance, predicting human actions well and matching human performance in the solo task (Table 2). The piKL model trained on the combined human data dataset performed best overall and was chosen for training the main AssistanceZero assistant used in comparisons.
Comparing Assistance Paradigms
The AssistanceZero-trained assistant (using the piKL-combined human model) was compared to analogues of other common AI assistant training pipelines in MBAG:
- Pretraining: Training a recurrent network to predict human actions on a large dataset of simulated human play (like GitHub Copilot/OpenAI Codex pretraining).
- Supervised Fine-Tuning (SFT): Fine-tuning the pretrained model on data of an expert human acting as the assistant (like the SFT stage of RLHF).
In simulation (Table 3), the AssistanceZero assistant substantially outperformed both Pretraining and SFT baselines, achieving higher goal completion and requiring significantly fewer human actions.
A human paper with 16 participants building houses in Minecraft confirmed these findings (Figure 1). Participants played alone, with the SFT assistant, with the AssistanceZero assistant, and with an expert human assistant.
- Objective Results: The AssistanceZero assistant significantly reduced the number of place/break actions humans took compared to playing alone (p<0.05). The SFT assistant showed only a minor reduction.
- Subjective Results: On a 5-point helpfulness scale, AssistanceZero was rated 3.1 ± 0.4, significantly higher than the SFT assistant (1.7 ± 0.3). The human expert was rated highest (4.0 ± 0.5), indicating room for improvement.
- Qualitative: Participants noted AssistanceZero's ability to understand implicit intentions and learn from corrections (e.g., correcting incorrectly placed blocks).
Practical Implementation Considerations:
- Computational Resources: Training AssistanceZero requires significant resources, involving parallel environments, MCTS simulations, and large neural networks.
- Environment Design: The custom MBAG simulator running much faster than real Minecraft is crucial for feasible training.
- Human Data: Collecting sufficient and diverse human interaction data is essential for training effective human models and evaluating assistants. The choice of human model (e.g., piKL vs. pure BC) impacts performance and generalization.
- Hyperparameter Tuning: Effective training, especially for complex models like PPO in this setting, requires extensive hyperparameter tuning. AssistanceZero's various loss components also require careful weighting (λ values).
- MCTS Configuration: Parameters like the number of simulations and cPUCT are critical for MCTS performance and balancing exploration/exploitation and fidelity to the learned model/prior.
Conclusion
The paper successfully demonstrates that complex assistance games can be solved scalably using the proposed AssistanceZero algorithm in the challenging MBAG environment. AssistanceZero, which combines MCTS with learned models of human actions and goals, yields significantly more effective AI assistants compared to methods analogous to current LLM pipelines like SFT. The human paper validates these findings and highlights the potential for assistance games to train helpful and collaborative AI agents that can learn from interaction and correction. The authors propose future work on applying assistance games to LLM post-training as a replacement for RLHF, aiming to develop assistants that are more robust, less prone to deception, and better at handling uncertainty and multi-turn interactions.