Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning (2502.06060v1)

Published 9 Feb 2025 in cs.AI, cs.CL, cs.LG, and cs.MA

Abstract: Communicating in natural language is a powerful tool in multi-agent settings, as it enables independent agents to share information in partially observable settings and allows zero-shot coordination with humans. However, most prior works are limited as they either rely on training with large amounts of human demonstrations or lack the ability to generate natural and useful communication strategies. In this work, we train LLMs to have productive discussions about their environment in natural language without any human demonstrations. We decompose the communication problem into listening and speaking. Our key idea is to leverage the agent's goal to predict useful information about the world as a dense reward signal that guides communication. Specifically, we improve a model's listening skills by training them to predict information about the environment based on discussions, and we simultaneously improve a model's speaking skills with multi-agent reinforcement learning by rewarding messages based on their influence on other agents. To investigate the role and necessity of communication in complex social settings, we study an embodied social deduction game based on Among Us, where the key question to answer is the identity of an adversarial imposter. We analyze emergent behaviors due to our technique, such as accusing suspects and providing evidence, and find that it enables strong discussions, doubling the win rates compared to standard RL. We release our code and models at https://socialdeductionLLM.github.io/

PDF Abstract

The paper "Training LLMs for Social Deduction with Multi-Agent Reinforcement Learning" (Sarkar et al., 9 Feb 2025 ) presents a framework for enabling LLM based agents to learn effective natural language communication strategies within multi-agent settings, specifically focusing on social deduction games, without reliance on human demonstration data. The methodology employs Multi-Agent Reinforcement Learning (MARL) and decomposes the communication problem into distinct listening and speaking components, guided by a dense reward signal derived from predicting critical game state information.

MARL Framework for Social Deduction

The environment is modeled as a Partially Observable Markov Game (POMG), augmented with a social deduction objective: identifying a specific element $q$ from a set $\mathcal{Q}$ (e.g., the imposter's identity). Each agent $i$ 's policy $\pi^i$ is parameterized by an LLM, specifically RWKV, selected for its recurrent architecture suited to processing long action-observation histories ( $\tau^i$ ) prevalent in MARL and its efficiency for RL fine-tuning. The agent's history $\tau^i_t = (o^i_0, a^i_0, \dots, o^i_t)$ is formatted as a sequence of text tokens. Action selection $a^i_t \sim \pi^i(\cdot | \tau^i_t)$ involves the LLM predicting the next token, constrained by the set of valid actions $\mathcal{A}(\tau^i_t)$ provided by the environment simulator.

A baseline MARL approach utilizes Proximal Policy Optimization (PPO) to optimize the agent policies ( $\pi_\text{RL}$ ). The objective maximizes the expected discounted sum of sparse environmental rewards $r^i_t$ (e.g., +1 for winning, -1 for losing, small rewards for task completion).

$L_\text{RL}(\pi, \tau^i_t) = -\mathbb{E}_{\pi}[A(\tau^i_t, a^i_t)]$

where $A(\tau^i_t, a^i_t)$ is the advantage estimate.

To mitigate catastrophic forgetting of language capabilities and prevent the policy from diverging into non-linguistic action sequences during RL optimization, a KL-divergence regularization term is added to the PPO loss. This penalizes deviations from the base pre-trained LLM ( $\pi_\text{RWKV}$ ):

$\lambda_\text{NL} \log \left( \frac{\pi(a^i_t | \tau^i_t)}{\pi_\text{RWKV}(a^i_t | \tau^i_t)} \right)$

Training employs iterated self-play, where agents are trained against frozen policies from previous iterations. Specifically, crewmates train against past imposter policies, while imposters train adversarially (using an inverted speaking reward, discussed later) against crewmate policies. To enhance robustness and prevent convergence to exploitable conventions (e.g., all agents remaining silent), heterogeneous team training is used: one crewmate agent is kept frozen using only the listening policy ( $\pi_L$ , detailed below), simulating ad-hoc teamwork scenarios.

Furthermore, a world modeling loss ( $L_\text{WM}$ ) is incorporated. This auxiliary loss trains the LLM policy to predict the next observation token $o^i_{t+1}$ based on the current history and action $\tau^i_t, a^i_t$ . This objective aids in stabilizing training for the recurrent RWKV architecture, preserves the model's understanding of environmental dynamics encoded in observations, and discourages the policy from over-utilizing action tokens during discussion phases.

$L_\text{WM}(\pi, \tau^i_t, a^i_t) = -\log \pi(o^i_{t+1} | \tau^i_t, a^i_t)$

Decomposed Communication Learning: Listening and Speaking

The core idea is to decouple the complex communication learning problem into two sub-problems: understanding incoming information (listening) and generating useful outgoing information (speaking).

Listening via Imposter Prediction

The listening skill is framed as a supervised learning task focused on the core deduction objective: predicting the imposter's identity $q$ . During specific points in the game (e.g., between discussion messages), the environment queries each crewmate agent $i$ for its belief about the imposter. The agent is trained to maximize the probability assigned to the true imposter $q$ given its current history $\tau^i_t$ . The loss function is the negative log-likelihood:

$L_\text{L}(\pi, \tau^i_t) = -\log \pi(q | \tau^i_t)$

This loss is computed only at query steps and utilizes the ground truth imposter identity $q$ provided by the simulator. This provides a direct, dense signal for interpreting observations and messages in the context of the primary game goal, effectively grounding the communication. The resulting policy optimized solely with RL and this listening loss is denoted $\pi_\text{RL+L}$ .

Speaking via Reinforced Discussion Learning

The speaking skill is trained using RL, guided by a novel dense reward signal $r^s_t$ that measures the immediate causal impact of an agent's message on the beliefs of its teammates. When agent $i$ sends a message at time $t'$ , resulting in a new history $\tau^k_t$ for other agents $k$ at time $t$ , the speaking reward is calculated as the change in the sum of probabilities assigned to the true imposter $q$ by all other living crewmates $C_t$ :

$r^s_t = B_t - B_{t'}$

where $B_t = \sum_{k \in C_t, k \neq i} \pi^k(q | \tau^k_t)$ and $B_{t'}$ is the corresponding sum before the message was processed (at time $t'$ ).

This reward directly incentivizes messages that positively influence the team's collective certainty about the imposter. For imposter training, an adversarial reward $-r^s_t$ is used to encourage messages that confuse or mislead crewmates. This reward term is incorporated into the PPO objective for message generation actions.

Integrated Objective

The final policy, $\pi_\text{RL+L+S}$ , is trained by combining the standard PPO loss for environmental rewards, the supervised listening loss $L_L$ , and the PPO loss incorporating the speaking reward $r^s_t$ , along with the KL regularization and world modeling loss.

$L_\text{RL+L+S} = L_\text{RL} + \lambda_\text{L} L_\text{L} + \lambda_\text{S} L_\text{S} + \lambda_\text{NL} L_\text{NL} + \lambda_\text{WM} L_\text{WM}$

where $L_S$ represents the PPO loss component derived from the speaking reward $r^s_t$ , and $\lambda$ terms are hyperparameters balancing the different objectives.

Role of Dense Reward Signal

The prediction of the imposter's identity serves as a crucial dense reward signal. Standard MARL in such settings often suffers from sparse rewards (win/loss occurs only at the end of an episode), making credit assignment difficult, especially for communicative actions whose benefits are indirect and delayed.

By contrast, the belief-based signals provide immediate feedback:

Listening: The $L_L$ loss provides frequent supervised signals during the game, guiding the agent to extract relevant information from observations and dialogue.
Speaking: The $r^s_t$ reward provides immediate reinforcement after each message, directly correlating the message content with its effect on teammates' critical beliefs ( $p(q|\tau)$ ).

This grounds the natural language communication in the underlying game objective, pushing agents to produce strategically relevant utterances rather than just syntactically correct text.

Experimental Validation and Results

The framework was evaluated in a custom 2D grid-world environment inspired by Among Us. Agents navigate, perform tasks, report bodies, engage in free-form text discussions, and vote. The environment simulates partial observability, task dependencies, kill mechanics, and voting dynamics.

Performance Metrics

Base LLMs (RWKV 1.5B, 7B) without MARL training achieved low win rates (<20%).
Standard PPO ( $\pi_\text{RL}$ ) improved performance but struggled with the deduction aspect.
Adding the listening loss ( $\pi_\text{RL+L}$ ) significantly boosted the crewmate win rate.
The full model incorporating the speaking reward ( $\pi_\text{RL+L+S}$ ) achieved the highest crewmate win rate, approximately doubling the win rate of the standard RL baseline ( $\pi_\text{RL}$ ) and substantially outperforming even larger base LLMs.

Emergent Behaviors

Qualitative analysis revealed the emergence of sophisticated, human-like communication strategies:

Accusations: Agents directly accuse others based on observed behavior or inconsistencies.
Evidence Provision: Agents support accusations with justifications derived from their observations (e.g., "Player Green was seen leaving the room where the body was found"). Agents also learned to sometimes fabricate evidence convincingly.
Imposter Counter-Play: Imposters learned to deflect suspicion, make counter-accusations, and mimic crewmate communication patterns to blend in.

Robustness and Ablations

Iterated self-play with adversarial imposter training resulted in robust crewmate policies, evidenced by a rapidly narrowing exploitability gap against evolving imposter strategies.
Ablation studies confirmed the significant contributions of both the listening loss ( $L_L$ ) and the speaking reward ( $r^s_t$ ) to the final performance. The listening component ( $L_L$ ) appeared particularly critical for grounding the agents' understanding and enabling effective deduction.

Conclusion

The research demonstrates a viable method for training LLM agents to use natural language communication effectively in complex MARL environments like social deduction games, bypassing the need for human communication data. By decomposing communication into listening (trained via supervised prediction of game state) and speaking (trained via RL with belief-based rewards), and leveraging the imposter prediction task as a dense reward signal, the framework enables agents to develop robust and strategically relevant communication behaviors, leading to significant performance improvements over standard MARL approaches.