- The paper introduces CollabUIAgents, a language multi-agent framework using Multi-Agent Credit Re-Assignment (CR) to enhance performance and cross-environment generalization in interactive tasks.
- The CR strategy leverages an LLM critic to provide granular, process-level rewards which are then used to generate preference data for policy optimization via Direct Preference Optimization (DPO).
- Empirical results show CollabUIAgents achieves state-of-the-art performance on mobile benchmarks and demonstrates significant cross-environment generalization, even rivaling larger models like GPT-4.
The CollabUIAgents framework (2502.14496) addresses a key limitation in language-based multi-agent systems (MAS) for interactive environments: the difficulty of concurrently achieving high task performance and robust generalization across diverse environments (e.g., mobile vs. web UI). While MAS often outperform single agents through collaboration, their generalization capabilities are frequently hampered by factors like predefined roles or reward structures tied too closely to specific training environments. CollabUIAgents introduces a multi-agent reinforcement learning (MARL) approach centered around a novel Multi-Agent Credit Re-Assignment (CR) strategy to foster generalizable collaborative behaviors.
CollabUIAgents Framework Architecture
CollabUIAgents employs a set of N language agents, {π1,...,πN}, initialized from the same base UIAgent model (e.g., an agentic fine-tuned LLM). Crucially, these agents are role-free, meaning they do not have predetermined specializations. Heterogeneity in their behavior arises dynamically through a communication network, modeled as a Directed Acyclic Graph (DAG), where edges EG dictate message passing. At each timestep t, given the environment observation ot and task query q, the system generates an action matrix At={ati,j}i∈[1,N],j∈[1,K], where K is the number of communication rounds. Agent πi produces actions ati,j based on its received messages and the shared observation/query. The final action executed in the environment is determined by aggregating the agents' proposals, typically via majority voting over the final round's actions {ati,K}.
A key component for enabling adaptable collaboration is the "edge update trick." During MARL training, the edges EG of the communication DAG are randomly updated (e.g., re-sampled) independently of the agent policy parameter updates. This mechanism forces agents to learn robust collaborative strategies that are not overly reliant on a fixed communication topology, thereby enhancing adaptability and potentially improving generalization to novel scenarios or team compositions.
Multi-Agent Credit Re-Assignment (CR) Strategy
The core innovation for improving both performance and generalization is the Multi-Agent Credit Re-Assignment (CR) strategy. This strategy circumvents the limitations of sparse, environment-specific outcome rewards (e.g., binary success/failure) by leveraging an LLM critic to provide dense, process-level feedback.
- LLM Critic for Process Rewards: An LLM-based critic agent, πcritic (e.g., GPT-4), assesses the quality of the proposed action matrix At at each timestep t. The critic considers the current observation ot, interaction history Ht−1, task query q, and the proposed actions At.
- Granular Reward Assignment: Instead of a single reward for the timestep or episode, the critic assigns fine-grained binary rewards rti,j∈{0,1} for each agent i and each communication round j. A reward rti,j=1 indicates the critic deems action ati,j appropriate given the context. This provides a significantly denser and more informative reward signal compared to typical MARL reward schemes.
- Leveraging LLM World Knowledge: The critic's evaluation relies on the general world knowledge embedded within the LLM. This makes the reward signal less dependent on the specifics of the training environment's reward function and promotes the learning of behaviors that are broadly sensible, thus enhancing generalizability. The paper suggests this can "restore generalizable behaviors from failed trajectories" by rewarding sensible intermediate steps even if the final outcome is unsuccessful.
- Synthesized Preference Data Generation: Directly using potentially noisy LLM rewards for value-based RL can be problematic. CollabUIAgents instead uses the CR rewards to generate preference data for policy optimization. An "adversarial agent," πadv, is employed. For each action ati,j deemed good by the critic (rti,j=1), πadv generates a corresponding low-quality action ati,j,−. This creates a preference pair (ati,j,ati,j,−) where ati,j is preferred over ati,j,−.
- Preference Optimization: The agent policies πi are optimized using Direct Preference Optimization (DPO) on the large corpus of synthesized preference data. The DPO objective aims to increase the likelihood of preferred actions and decrease the likelihood of dispreferred actions, relative to a reference policy πref (typically the initial base UIAgent):
LDPO(πθ;πref)=−E(aw,al)∼D[logσ(βlogπref(aw∣x)πθ(aw∣x)−βlogπref(al∣x)πθ(al∣x))]
where x represents the context (observation, query, history, messages), (aw,al) is a preferred/dispreferred action pair (i.e., (ati,j,ati,j,−)), D is the dataset of synthesized preferences, σ is the sigmoid function, and β is a temperature parameter. This approach avoids explicit value function estimation and leverages the robustness of preference learning signals, even with imperfect critic assessments.
Enhancing Cross-Environment Generalization
Several aspects of CollabUIAgents contribute to its improved generalization capabilities:
- Role-Free Agents: The lack of fixed roles allows the multi-agent system to potentially adapt more readily to new environments or tasks where predefined specializations might be suboptimal.
- Generalizable Reward Signal: The CR strategy's reliance on LLM world knowledge provides a supervisory signal that encourages learning interaction principles applicable beyond the training environment.
- Edge Updates: Randomly varying the communication structure during training prevents overfitting to specific collaborative patterns and promotes robustness.
- Cross-Environment Adaptation:
- Direct Transfer: Trained CollabUIAgents models can be directly applied to unseen environments, leveraging learned general UI interaction patterns. Performance improvements over the base single agent are observed even with direct transfer (e.g., mobile-trained agents on web tasks).
- Continual MARL: For more significant domain shifts or maximizing performance in a new environment, the framework supports continual MARL. The agents undergo further training using the CR strategy on data collected (potentially autonomously) from the new environment. This allows specialization without catastrophic forgetting, as demonstrated by maintained performance on original mobile tasks after continual training on web tasks.
Empirical Validation and Insights
Empirical results demonstrate the effectiveness of CollabUIAgents.
- Performance: On mobile benchmarks (AndroidWorld, MobileMiniWoB++), the 7B parameter
CollabUIAgents_mobile
achieves state-of-the-art results among open-source models, notably outperforming Gemini 1.5 Pro and matching or exceeding the performance of the GPT-4 critic used for CR. This highlights the framework's ability to distill strong capabilities from the critic into smaller, specialized agents.
- Generalization: CollabUIAgents shows significantly higher generalization uplift (
Δ_Generalization
) compared to baselines when tested on unseen tasks within the same domain. When transferring across domains (mobile to web), direct transfer shows gains over the base agent, and continual MARL (CollabUIAgents_web
) achieves substantial performance on web benchmarks (Mind2Web, AutoWebBench), rivaling GPT-4 despite lacking large-scale web pre-training.
- Ablation Studies:
- Removing the CR strategy (
w/o CR
) drastically reduces both performance and generalization, confirming its critical role.
- Replacing DPO with rejective sampling fine-tuning (RFT) based on the CR rewards (
w/ PO -> RFT
) is less effective, emphasizing that the combination of granular CR rewards and preference optimization (DPO) is key (Takeaway 3).
- Removing edge updates (
w/o Edge Update
) results in a minor decrease in performance and generalization, supporting its contribution to adaptability (Takeaway 4).
In conclusion, the CollabUIAgents framework effectively enhances language multi-agent learning by introducing a multi-agent credit re-assignment strategy. This strategy utilizes an LLM critic for granular, process-level rewards and optimizes policies via DPO on synthesized preference data. Combined with role-free agents and adaptive communication structures via edge updates, this approach yields significant improvements in both task performance and, notably, cross-environment generalization for interactive agents, demonstrating strong results even compared to powerful closed-source models.