Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning (2502.06060v1)

Published 9 Feb 2025 in cs.AI, cs.CL, cs.LG, and cs.MA
Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning

Abstract: Communicating in natural language is a powerful tool in multi-agent settings, as it enables independent agents to share information in partially observable settings and allows zero-shot coordination with humans. However, most prior works are limited as they either rely on training with large amounts of human demonstrations or lack the ability to generate natural and useful communication strategies. In this work, we train LLMs to have productive discussions about their environment in natural language without any human demonstrations. We decompose the communication problem into listening and speaking. Our key idea is to leverage the agent's goal to predict useful information about the world as a dense reward signal that guides communication. Specifically, we improve a model's listening skills by training them to predict information about the environment based on discussions, and we simultaneously improve a model's speaking skills with multi-agent reinforcement learning by rewarding messages based on their influence on other agents. To investigate the role and necessity of communication in complex social settings, we study an embodied social deduction game based on Among Us, where the key question to answer is the identity of an adversarial imposter. We analyze emergent behaviors due to our technique, such as accusing suspects and providing evidence, and find that it enables strong discussions, doubling the win rates compared to standard RL. We release our code and models at https://socialdeductionLLM.github.io/

The paper "Training LLMs for Social Deduction with Multi-Agent Reinforcement Learning" (Sarkar et al., 9 Feb 2025 ) presents a framework for enabling LLM based agents to learn effective natural language communication strategies within multi-agent settings, specifically focusing on social deduction games, without reliance on human demonstration data. The methodology employs Multi-Agent Reinforcement Learning (MARL) and decomposes the communication problem into distinct listening and speaking components, guided by a dense reward signal derived from predicting critical game state information.

MARL Framework for Social Deduction

The environment is modeled as a Partially Observable Markov Game (POMG), augmented with a social deduction objective: identifying a specific element qq from a set Q\mathcal{Q} (e.g., the imposter's identity). Each agent ii's policy πi\pi^i is parameterized by an LLM, specifically RWKV, selected for its recurrent architecture suited to processing long action-observation histories (τi\tau^i) prevalent in MARL and its efficiency for RL fine-tuning. The agent's history τti=(o0i,a0i,,oti)\tau^i_t = (o^i_0, a^i_0, \dots, o^i_t) is formatted as a sequence of text tokens. Action selection atiπi(τti)a^i_t \sim \pi^i(\cdot | \tau^i_t) involves the LLM predicting the next token, constrained by the set of valid actions A(τti)\mathcal{A}(\tau^i_t) provided by the environment simulator.

A baseline MARL approach utilizes Proximal Policy Optimization (PPO) to optimize the agent policies (πRL\pi_\text{RL}). The objective maximizes the expected discounted sum of sparse environmental rewards rtir^i_t (e.g., +1 for winning, -1 for losing, small rewards for task completion).

LRL(π,τti)=Eπ[A(τti,ati)]L_\text{RL}(\pi, \tau^i_t) = -\mathbb{E}_{\pi}[A(\tau^i_t, a^i_t)]

where A(τti,ati)A(\tau^i_t, a^i_t) is the advantage estimate.

To mitigate catastrophic forgetting of language capabilities and prevent the policy from diverging into non-linguistic action sequences during RL optimization, a KL-divergence regularization term is added to the PPO loss. This penalizes deviations from the base pre-trained LLM (πRWKV\pi_\text{RWKV}):

λNLlog(π(atiτti)πRWKV(atiτti))\lambda_\text{NL} \log \left( \frac{\pi(a^i_t | \tau^i_t)}{\pi_\text{RWKV}(a^i_t | \tau^i_t)} \right)

Training employs iterated self-play, where agents are trained against frozen policies from previous iterations. Specifically, crewmates train against past imposter policies, while imposters train adversarially (using an inverted speaking reward, discussed later) against crewmate policies. To enhance robustness and prevent convergence to exploitable conventions (e.g., all agents remaining silent), heterogeneous team training is used: one crewmate agent is kept frozen using only the listening policy (πL\pi_L, detailed below), simulating ad-hoc teamwork scenarios.

Furthermore, a world modeling loss (LWML_\text{WM}) is incorporated. This auxiliary loss trains the LLM policy to predict the next observation token ot+1io^i_{t+1} based on the current history and action τti,ati\tau^i_t, a^i_t. This objective aids in stabilizing training for the recurrent RWKV architecture, preserves the model's understanding of environmental dynamics encoded in observations, and discourages the policy from over-utilizing action tokens during discussion phases.

LWM(π,τti,ati)=logπ(ot+1iτti,ati)L_\text{WM}(\pi, \tau^i_t, a^i_t) = -\log \pi(o^i_{t+1} | \tau^i_t, a^i_t)

Decomposed Communication Learning: Listening and Speaking

The core idea is to decouple the complex communication learning problem into two sub-problems: understanding incoming information (listening) and generating useful outgoing information (speaking).

Listening via Imposter Prediction

The listening skill is framed as a supervised learning task focused on the core deduction objective: predicting the imposter's identity qq. During specific points in the game (e.g., between discussion messages), the environment queries each crewmate agent ii for its belief about the imposter. The agent is trained to maximize the probability assigned to the true imposter qq given its current history τti\tau^i_t. The loss function is the negative log-likelihood:

LL(π,τti)=logπ(qτti)L_\text{L}(\pi, \tau^i_t) = -\log \pi(q | \tau^i_t)

This loss is computed only at query steps and utilizes the ground truth imposter identity qq provided by the simulator. This provides a direct, dense signal for interpreting observations and messages in the context of the primary game goal, effectively grounding the communication. The resulting policy optimized solely with RL and this listening loss is denoted πRL+L\pi_\text{RL+L}.

Speaking via Reinforced Discussion Learning

The speaking skill is trained using RL, guided by a novel dense reward signal rtsr^s_t that measures the immediate causal impact of an agent's message on the beliefs of its teammates. When agent ii sends a message at time tt', resulting in a new history τtk\tau^k_t for other agents kk at time tt, the speaking reward is calculated as the change in the sum of probabilities assigned to the true imposter qq by all other living crewmates CtC_t:

rts=BtBtr^s_t = B_t - B_{t'}

where Bt=kCt,kiπk(qτtk)B_t = \sum_{k \in C_t, k \neq i} \pi^k(q | \tau^k_t) and BtB_{t'} is the corresponding sum before the message was processed (at time tt').

This reward directly incentivizes messages that positively influence the team's collective certainty about the imposter. For imposter training, an adversarial reward rts-r^s_t is used to encourage messages that confuse or mislead crewmates. This reward term is incorporated into the PPO objective for message generation actions.

Integrated Objective

The final policy, πRL+L+S\pi_\text{RL+L+S}, is trained by combining the standard PPO loss for environmental rewards, the supervised listening loss LLL_L, and the PPO loss incorporating the speaking reward rtsr^s_t, along with the KL regularization and world modeling loss.

LRL+L+S=LRL+λLLL+λSLS+λNLLNL+λWMLWML_\text{RL+L+S} = L_\text{RL} + \lambda_\text{L} L_\text{L} + \lambda_\text{S} L_\text{S} + \lambda_\text{NL} L_\text{NL} + \lambda_\text{WM} L_\text{WM}

where LSL_S represents the PPO loss component derived from the speaking reward rtsr^s_t, and λ\lambda terms are hyperparameters balancing the different objectives.

Role of Dense Reward Signal

The prediction of the imposter's identity serves as a crucial dense reward signal. Standard MARL in such settings often suffers from sparse rewards (win/loss occurs only at the end of an episode), making credit assignment difficult, especially for communicative actions whose benefits are indirect and delayed.

By contrast, the belief-based signals provide immediate feedback:

  1. Listening: The LLL_L loss provides frequent supervised signals during the game, guiding the agent to extract relevant information from observations and dialogue.
  2. Speaking: The rtsr^s_t reward provides immediate reinforcement after each message, directly correlating the message content with its effect on teammates' critical beliefs (p(qτ)p(q|\tau)).

This grounds the natural language communication in the underlying game objective, pushing agents to produce strategically relevant utterances rather than just syntactically correct text.

Experimental Validation and Results

The framework was evaluated in a custom 2D grid-world environment inspired by Among Us. Agents navigate, perform tasks, report bodies, engage in free-form text discussions, and vote. The environment simulates partial observability, task dependencies, kill mechanics, and voting dynamics.

Performance Metrics

  • Base LLMs (RWKV 1.5B, 7B) without MARL training achieved low win rates (<20%).
  • Standard PPO (πRL\pi_\text{RL}) improved performance but struggled with the deduction aspect.
  • Adding the listening loss (πRL+L\pi_\text{RL+L}) significantly boosted the crewmate win rate.
  • The full model incorporating the speaking reward (πRL+L+S\pi_\text{RL+L+S}) achieved the highest crewmate win rate, approximately doubling the win rate of the standard RL baseline (πRL\pi_\text{RL}) and substantially outperforming even larger base LLMs.

Emergent Behaviors

Qualitative analysis revealed the emergence of sophisticated, human-like communication strategies:

  • Accusations: Agents directly accuse others based on observed behavior or inconsistencies.
  • Evidence Provision: Agents support accusations with justifications derived from their observations (e.g., "Player Green was seen leaving the room where the body was found"). Agents also learned to sometimes fabricate evidence convincingly.
  • Imposter Counter-Play: Imposters learned to deflect suspicion, make counter-accusations, and mimic crewmate communication patterns to blend in.

Robustness and Ablations

  • Iterated self-play with adversarial imposter training resulted in robust crewmate policies, evidenced by a rapidly narrowing exploitability gap against evolving imposter strategies.
  • Ablation studies confirmed the significant contributions of both the listening loss (LLL_L) and the speaking reward (rtsr^s_t) to the final performance. The listening component (LLL_L) appeared particularly critical for grounding the agents' understanding and enabling effective deduction.

Conclusion

The research demonstrates a viable method for training LLM agents to use natural language communication effectively in complex MARL environments like social deduction games, bypassing the need for human communication data. By decomposing communication into listening (trained via supervised prediction of game state) and speaking (trained via RL with belief-based rewards), and leveraging the imposter prediction task as a dense reward signal, the framework enables agents to develop robust and strategically relevant communication behaviors, leading to significant performance improvements over standard MARL approaches.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Bidipta Sarkar (9 papers)
  2. Warren Xia (1 paper)
  3. C. Karen Liu (93 papers)
  4. Dorsa Sadigh (162 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com