Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 86 tok/s

Gemini 2.5 Pro 60 tok/s Pro

GPT-5 Medium 28 tok/s

GPT-5 High 34 tok/s Pro

GPT-4o 72 tok/s

GPT OSS 120B 441 tok/s Pro

Kimi K2 200 tok/s Pro

2000 character limit reached

AssistanceZero: Scalably Solving Assistance Games (2504.07091v2)

Published 9 Apr 2025 in cs.AI and cs.LG

Abstract: Assistance games are a promising alternative to reinforcement learning from human feedback (RLHF) for training AI assistants. Assistance games resolve key drawbacks of RLHF, such as incentives for deceptive behavior, by explicitly modeling the interaction between assistant and user as a two-player game where the assistant cannot observe their shared goal. Despite their potential, assistance games have only been explored in simple settings. Scaling them to more complex environments is difficult because it requires both solving intractable decision-making problems under uncertainty and accurately modeling human users' behavior. We present the first scalable approach to solving assistance games and apply it to a new, challenging Minecraft-based assistance game with over $10^{400}$ possible goals. Our approach, AssistanceZero, extends AlphaZero with a neural network that predicts human actions and rewards, enabling it to plan under uncertainty. We show that AssistanceZero outperforms model-free RL algorithms and imitation learning in the Minecraft-based assistance game. In a human study, our AssistanceZero-trained assistant significantly reduces the number of actions participants take to complete building tasks in Minecraft. Our results suggest that assistance games are a tractable framework for training effective AI assistants in complex environments. Our code and models are available at https://github.com/cassidylaidlaw/minecraft-building-assistance-game.

Collections

Summary

The paper introduces AssistanceZero, a scalable algorithm that leverages learned models and MCTS to solve assistance games in a complex 3D grid world.
It demonstrates significant improvements over PPO, achieving 79.8% goal completion and drastically reducing human intervention in the MBAG benchmark.
It outlines critical considerations for human modeling and simulation design, offering a robust alternative to RLHF for training collaborative AI assistants.

Assistance games offer a promising alternative to the standard @@@@1@@@@ from human feedback (RLHF) paradigm for training AI assistants. Unlike RLHF, which trains an assistant to maximize potentially manipulable human feedback, assistance games model the interaction between a human user and an AI assistant as a cooperative two-player game where the assistant is initially uncertain about the shared goal. This framework naturally incentivizes the assistant to learn about the user's objective through interaction and encourages complementary actions rather than mere prediction or replacement of human actions.

However, applying assistance games to complex, real-world-like scenarios has been challenging due to two main difficulties:

Intractable Decision Making: Solving assistance games requires the assistant to reason and plan under uncertainty over a vast space of possible goals, which is computationally demanding.
Accurate Human Modeling: Effective assistance depends on the assistant accurately modeling the human user's behavior, which is complex and can deviate from simple optimality assumptions.

This paper tackles these challenges by introducing AssistanceZero, a scalable algorithm for solving assistance games, and evaluating it in the Minecraft Building Assistance Game (MBAG), a new, complex benchmark.

The Minecraft Building Assistance Game (MBAG)

MBAG is designed to be a more realistic testbed for assistance games.

Environment: A 3D grid-world where human and assistant agents can move, place blocks, and break blocks.
State: Includes the block configuration of the grid, player locations, and player inventories.
Action Space: Consists of navigation and parameterized place/break actions, resulting in a very large action space (over 20,000 possible actions in an $11 \times 10 \times 10$ grid).
Reward Function: Shared between the human and assistant, based on the reduction in edit distance to a target goal structure. Correct actions yield +1, incorrect actions -1.
Goals: Sampled from a complex distribution of house structures derived from the CraftAssist dataset [gray_craftassist_2019]. The number of possible goals is extremely large ( $>10^{400}$ ). The human knows the goal, but the assistant does not.
Implementation: The environment is implemented as a high-speed Python/C simulator capable of syncing with a real Minecraft instance via the Malmo mod [johnson_malmo_2016] for visualization and human studies. The version used in the paper provides unlimited blocks to simplify resource management.

MBAG fulfills the desiderata of having a complex, structured goal distribution and requiring different levels of goal information for effective assistance (e.g., basic foundation work needs less info than complex details).

AssistanceZero: Scalable Assistance Game Solving

The paper demonstrates that standard model-free deep RL algorithms like PPO struggle in MBAG. PPO, even with modifications like only rewarding the assistant's own actions or adding auxiliary losses (see Appendix Table 5), shows minimal helpfulness (Assistant goal % < 8%, Human actions reduced by ~4-5). This is attributed to the highly noisy and delayed reward signal in the assistance game setting, making it difficult for PPO to simultaneously learn about the hidden goal and plan effectively.

AssistanceZero addresses this by extending the AlphaZero framework [silver_mastering_2017] to handle the partial observability and multi-agent nature of assistance games.

Core Idea: Separate goal prediction from action selection using a learned model and Monte Carlo Tree Search (MCTS).
Neural Network: A recurrent neural network takes the state-action history ( $h$ $h$ ) as input and has four heads:
- Policy head ( $\pi^\phi(a^R \mid h)$ ) for suggesting assistant actions.
- Value head ( $\hat{V}^\phi(h)$ ) for estimating the value of a state/history.
- Reward Parameter Prediction head ( $\hat{p}^\phi(\theta \mid h)$ ) for predicting the distribution over the hidden goal parameters (block types at each location in MBAG).
- Human Action Prediction head ( $\hat{p}^\phi(a^H \mid h)$ ) for predicting the human's next action.
MCTS: Uses the learned predictions ( $\hat{p}^\phi(\theta \mid h)$ , $\hat{p}^\phi(a^H \mid h)$ ) to simulate future trajectories under uncertainty. It estimates rewards by marginalizing over the predicted goal distribution and simulates human actions using the predicted human policy. MCTS guides the assistant's action selection.
Training: Alternates between collecting trajectories using MCTS with the current network and training the network using a multi-component loss function (Equation 1):
- KL divergence between MCTS policy output and network policy head.
- Squared error between MCTS value estimate and network value head.
- Negative log-likelihood for predicting the true goal parameters.
- A KL penalty between current and previous goal predictions to prevent overfitting to recent history.
- Negative log-likelihood for predicting the human's action.
Efficiency: Uses a variant of MCTS similar to POMCP [silver_monte-carlo_2010] but with a learned model. Employs a "bi-level" action selection in MCTS for structured action spaces like MBAG (Appendix E).

AssistanceZero significantly outperforms PPO in simulation (Table 1), achieving higher overall goal completion (79.8% vs 71.6%) and drastically reducing human actions (158 vs 203) while building a substantial portion of the structure itself (27% vs <8%).

Human Modeling for Assistance Games

The choice of human model used during training is crucial for an assistant's performance with real users. The paper explores several modeling approaches:

Reward-based: Training agents (PPO, AlphaZero) to solve the task alone, assuming humans act optimally or Boltzmann-rationally.
Data-based (Behavior Cloning - BC): Training a model to imitate human actions recorded in the environment. BC models were trained on human data building alone, with an assistant, and a combination of both. Recurrent networks, data augmentation, and dropout were found important for BC performance (Appendix Table 7).
Combined (piKL): Using MCTS with a BC-trained policy as a prior [jacob_modeling_2022], balancing reward maximization and human-likeness.

Evaluation showed that pure reward-based models poorly predict human actions and perform the task much faster than humans. BC models predict actions well but tend to suffer from compounding errors when playing alone. piKL models offer the best balance, predicting human actions well and matching human performance in the solo task (Table 2). The piKL model trained on the combined human data dataset performed best overall and was chosen for training the main AssistanceZero assistant used in comparisons.

Comparing Assistance Paradigms

The AssistanceZero-trained assistant (using the piKL-combined human model) was compared to analogues of other common AI assistant training pipelines in MBAG:

Pretraining: Training a recurrent network to predict human actions on a large dataset of simulated human play (like GitHub Copilot/OpenAI Codex pretraining).
Supervised Fine-Tuning (SFT): Fine-tuning the pretrained model on data of an expert human acting as the assistant (like the SFT stage of RLHF).

In simulation (Table 3), the AssistanceZero assistant substantially outperformed both Pretraining and SFT baselines, achieving higher goal completion and requiring significantly fewer human actions.

A human paper with 16 participants building houses in Minecraft confirmed these findings (Figure 1). Participants played alone, with the SFT assistant, with the AssistanceZero assistant, and with an expert human assistant.

Objective Results: The AssistanceZero assistant significantly reduced the number of place/break actions humans took compared to playing alone ( $p<0.05$ ). The SFT assistant showed only a minor reduction.
Subjective Results: On a 5-point helpfulness scale, AssistanceZero was rated 3.1 $\pm$ 0.4, significantly higher than the SFT assistant (1.7 $\pm$ 0.3). The human expert was rated highest (4.0 $\pm$ 0.5), indicating room for improvement.
Qualitative: Participants noted AssistanceZero's ability to understand implicit intentions and learn from corrections (e.g., correcting incorrectly placed blocks).

Practical Implementation Considerations:

Computational Resources: Training AssistanceZero requires significant resources, involving parallel environments, MCTS simulations, and large neural networks.
Environment Design: The custom MBAG simulator running much faster than real Minecraft is crucial for feasible training.
Human Data: Collecting sufficient and diverse human interaction data is essential for training effective human models and evaluating assistants. The choice of human model (e.g., piKL vs. pure BC) impacts performance and generalization.
Hyperparameter Tuning: Effective training, especially for complex models like PPO in this setting, requires extensive hyperparameter tuning. AssistanceZero's various loss components also require careful weighting ( $\lambda$ values).
MCTS Configuration: Parameters like the number of simulations and $c_\text{PUCT}$ are critical for MCTS performance and balancing exploration/exploitation and fidelity to the learned model/prior.

Conclusion

The paper successfully demonstrates that complex assistance games can be solved scalably using the proposed AssistanceZero algorithm in the challenging MBAG environment. AssistanceZero, which combines MCTS with learned models of human actions and goals, yields significantly more effective AI assistants compared to methods analogous to current LLM pipelines like SFT. The human paper validates these findings and highlights the potential for assistance games to train helpful and collaborative AI agents that can learn from interaction and correction. The authors propose future work on applying assistance games to LLM post-training as a replacement for RLHF, aiming to develop assistants that are more robust, less prone to deception, and better at handling uncertainty and multi-turn interactions.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (8)

GitHub

GitHub - cassidylaidlaw/minecraft-building-assistance-game (2 stars)

Tweets

https://twitter.com/cassidy_laidlaw/status/1910708820793557453

https://twitter.com/_jedamski/status/1953120238892978283

https://twitter.com/ILikeCardano/status/1910907491522908401

https://twitter.com/batster41/status/1911064602344824940

https://twitter.com/justinsvegliato/status/1910788639434641579

https://twitter.com/FullA_cc/status/1911044329046450282

Beyond RL: AssistanceZero's Revolutionary Approach to Collaborative AI in Minecraft (It Learns How to Help You!) (21 points, 4 comments)

AssistanceZero: Scalably Solving Assistance Games (2504.07091v2)

Collections

Summary

Paper Prompts

Follow-up Questions

Authors (8)

GitHub

Tweets

Reddit

Don't miss out on important new AI/ML research

AssistanceZero: Scalably Solving Assistance Games (2504.07091v2)

Collections

Summary

Paper Prompts

Follow-up Questions

Related Papers

Authors (8)

GitHub

Tweets

Reddit

Don't miss out on important new AI/ML research