Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization (2502.14496v2)

Published 20 Feb 2025 in cs.CL

Abstract: LLM-based agents have made significant advancements in interactive environments, such as mobile operations and web browsing, and other domains beyond computer using. Current multi-agent systems universally excel in performance, compared to single agents, but struggle with generalization across environments due to predefined roles and inadequate strategies for generalizing language agents. The challenge of achieving both strong performance and good generalization has hindered the progress of multi-agent systems for interactive environments. To address these issues, we propose CollabUIAgents, a multi-agent reinforcement learning framework with a novel multi-agent credit re-assignment (CR) strategy, assigning process rewards with LLMs rather than environment-specific rewards and learning with synthesized preference data, in order to foster generalizable, collaborative behaviors among the role-free agents' policies. Empirical results show that our framework improves both performance and cross-environment generalizability of multi-agent systems. Moreover, our 7B-parameter system achieves results on par with or exceed strong closed-source models, and the LLM that guides the CR. We also provide insights in using granular CR rewards effectively for environment generalization, and accommodating trained LLMs in multi-agent systems.

Summary

  • The paper introduces CollabUIAgents, a language multi-agent framework using Multi-Agent Credit Re-Assignment (CR) to enhance performance and cross-environment generalization in interactive tasks.
  • The CR strategy leverages an LLM critic to provide granular, process-level rewards which are then used to generate preference data for policy optimization via Direct Preference Optimization (DPO).
  • Empirical results show CollabUIAgents achieves state-of-the-art performance on mobile benchmarks and demonstrates significant cross-environment generalization, even rivaling larger models like GPT-4.

The CollabUIAgents framework (2502.14496) addresses a key limitation in language-based multi-agent systems (MAS) for interactive environments: the difficulty of concurrently achieving high task performance and robust generalization across diverse environments (e.g., mobile vs. web UI). While MAS often outperform single agents through collaboration, their generalization capabilities are frequently hampered by factors like predefined roles or reward structures tied too closely to specific training environments. CollabUIAgents introduces a multi-agent reinforcement learning (MARL) approach centered around a novel Multi-Agent Credit Re-Assignment (CR) strategy to foster generalizable collaborative behaviors.

CollabUIAgents Framework Architecture

CollabUIAgents employs a set of NN language agents, {π1,...,πN}\{\pi_1, ..., \pi_N\}, initialized from the same base UIAgent model (e.g., an agentic fine-tuned LLM). Crucially, these agents are role-free, meaning they do not have predetermined specializations. Heterogeneity in their behavior arises dynamically through a communication network, modeled as a Directed Acyclic Graph (DAG), where edges EGE_G dictate message passing. At each timestep tt, given the environment observation oto_t and task query qq, the system generates an action matrix At={ati,j}i[1,N],j[1,K]A_t = \{a_t^{i,j}\}_{i \in [1, N], j \in [1, K]}, where KK is the number of communication rounds. Agent πi\pi_i produces actions ati,ja_t^{i,j} based on its received messages and the shared observation/query. The final action executed in the environment is determined by aggregating the agents' proposals, typically via majority voting over the final round's actions {ati,K}\{a_t^{i,K}\}.

A key component for enabling adaptable collaboration is the "edge update trick." During MARL training, the edges EGE_G of the communication DAG are randomly updated (e.g., re-sampled) independently of the agent policy parameter updates. This mechanism forces agents to learn robust collaborative strategies that are not overly reliant on a fixed communication topology, thereby enhancing adaptability and potentially improving generalization to novel scenarios or team compositions.

Multi-Agent Credit Re-Assignment (CR) Strategy

The core innovation for improving both performance and generalization is the Multi-Agent Credit Re-Assignment (CR) strategy. This strategy circumvents the limitations of sparse, environment-specific outcome rewards (e.g., binary success/failure) by leveraging an LLM critic to provide dense, process-level feedback.

  1. LLM Critic for Process Rewards: An LLM-based critic agent, πcritic\pi_{critic} (e.g., GPT-4), assesses the quality of the proposed action matrix AtA_t at each timestep tt. The critic considers the current observation oto_t, interaction history Ht1H_{t-1}, task query qq, and the proposed actions AtA_t.
  2. Granular Reward Assignment: Instead of a single reward for the timestep or episode, the critic assigns fine-grained binary rewards rti,j{0,1}r_t^{i,j} \in \{0, 1\} for each agent ii and each communication round jj. A reward rti,j=1r_t^{i,j} = 1 indicates the critic deems action ati,ja_t^{i,j} appropriate given the context. This provides a significantly denser and more informative reward signal compared to typical MARL reward schemes.
  3. Leveraging LLM World Knowledge: The critic's evaluation relies on the general world knowledge embedded within the LLM. This makes the reward signal less dependent on the specifics of the training environment's reward function and promotes the learning of behaviors that are broadly sensible, thus enhancing generalizability. The paper suggests this can "restore generalizable behaviors from failed trajectories" by rewarding sensible intermediate steps even if the final outcome is unsuccessful.
  4. Synthesized Preference Data Generation: Directly using potentially noisy LLM rewards for value-based RL can be problematic. CollabUIAgents instead uses the CR rewards to generate preference data for policy optimization. An "adversarial agent," πadv\pi_{adv}, is employed. For each action ati,ja_t^{i,j} deemed good by the critic (rti,j=1r_t^{i,j} = 1), πadv\pi_{adv} generates a corresponding low-quality action ati,j,a_t^{i,j,-}. This creates a preference pair (ati,j,ati,j,)(a_t^{i,j}, a_t^{i,j,-}) where ati,ja_t^{i,j} is preferred over ati,j,a_t^{i,j,-}.
  5. Preference Optimization: The agent policies πi\pi_i are optimized using Direct Preference Optimization (DPO) on the large corpus of synthesized preference data. The DPO objective aims to increase the likelihood of preferred actions and decrease the likelihood of dispreferred actions, relative to a reference policy πref\pi_{ref} (typically the initial base UIAgent):

    LDPO(πθ;πref)=E(aw,al)D[logσ(βlogπθ(awx)πref(awx)βlogπθ(alx)πref(alx))]\mathcal{L}_{DPO}(\pi_\theta; \pi_{ref}) = -\mathbb{E}_{(a_w, a_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(a_w | x)}{\pi_{ref}(a_w | x)} - \beta \log \frac{\pi_\theta(a_l | x)}{\pi_{ref}(a_l | x)} \right) \right]

    where xx represents the context (observation, query, history, messages), (aw,al)(a_w, a_l) is a preferred/dispreferred action pair (i.e., (ati,j,ati,j,)(a_t^{i,j}, a_t^{i,j,-})), D\mathcal{D} is the dataset of synthesized preferences, σ\sigma is the sigmoid function, and β\beta is a temperature parameter. This approach avoids explicit value function estimation and leverages the robustness of preference learning signals, even with imperfect critic assessments.

Enhancing Cross-Environment Generalization

Several aspects of CollabUIAgents contribute to its improved generalization capabilities:

  1. Role-Free Agents: The lack of fixed roles allows the multi-agent system to potentially adapt more readily to new environments or tasks where predefined specializations might be suboptimal.
  2. Generalizable Reward Signal: The CR strategy's reliance on LLM world knowledge provides a supervisory signal that encourages learning interaction principles applicable beyond the training environment.
  3. Edge Updates: Randomly varying the communication structure during training prevents overfitting to specific collaborative patterns and promotes robustness.
  4. Cross-Environment Adaptation:
    • Direct Transfer: Trained CollabUIAgents models can be directly applied to unseen environments, leveraging learned general UI interaction patterns. Performance improvements over the base single agent are observed even with direct transfer (e.g., mobile-trained agents on web tasks).
    • Continual MARL: For more significant domain shifts or maximizing performance in a new environment, the framework supports continual MARL. The agents undergo further training using the CR strategy on data collected (potentially autonomously) from the new environment. This allows specialization without catastrophic forgetting, as demonstrated by maintained performance on original mobile tasks after continual training on web tasks.

Empirical Validation and Insights

Empirical results demonstrate the effectiveness of CollabUIAgents.

  • Performance: On mobile benchmarks (AndroidWorld, MobileMiniWoB++), the 7B parameter CollabUIAgents_mobile achieves state-of-the-art results among open-source models, notably outperforming Gemini 1.5 Pro and matching or exceeding the performance of the GPT-4 critic used for CR. This highlights the framework's ability to distill strong capabilities from the critic into smaller, specialized agents.
  • Generalization: CollabUIAgents shows significantly higher generalization uplift (Δ_Generalization) compared to baselines when tested on unseen tasks within the same domain. When transferring across domains (mobile to web), direct transfer shows gains over the base agent, and continual MARL (CollabUIAgents_web) achieves substantial performance on web benchmarks (Mind2Web, AutoWebBench), rivaling GPT-4 despite lacking large-scale web pre-training.
  • Ablation Studies:
    • Removing the CR strategy (w/o CR) drastically reduces both performance and generalization, confirming its critical role.
    • Replacing DPO with rejective sampling fine-tuning (RFT) based on the CR rewards (w/ PO -> RFT) is less effective, emphasizing that the combination of granular CR rewards and preference optimization (DPO) is key (Takeaway 3).
    • Removing edge updates (w/o Edge Update) results in a minor decrease in performance and generalization, supporting its contribution to adaptability (Takeaway 4).

In conclusion, the CollabUIAgents framework effectively enhances language multi-agent learning by introducing a multi-agent credit re-assignment strategy. This strategy utilizes an LLM critic for granular, process-level rewards and optimizes policies via DPO on synthesized preference data. Combined with role-free agents and adaptive communication structures via edge updates, this approach yields significant improvements in both task performance and, notably, cross-environment generalization for interactive agents, demonstrating strong results even compared to powerful closed-source models.