A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning (2510.12838v3)

Published 13 Oct 2025 in cs.CL and cs.AI

Abstract: LLMs split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A$^2$FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third mode-instant-that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A$^2$FM achieves 13.4% on BrowseComp, 70.4% on AIME25, and 16.7% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only $0.00487 per correct answer-cutting cost by 45.2% relative to reasoning and 33.5% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.

Summary

The paper presents A²FM, which integrates instant, reasoning, and agentic modes using a route-then-align strategy to optimize task processing.
It employs Adaptive Policy Optimization to dynamically fine-tune mode-specific sampling and cost-regularized rewards, improving both efficiency and accuracy.
Experimental results demonstrate that A²FM outperforms conventional models in agentic, reasoning, and general benchmarks by balancing resource use with high task performance.

"A $^2$ FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning"

The paper "A $^2$ FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning" presents a novel framework for enhancing the efficacy of LLMs through the integration of complementary modes: agentic, reasoning, and instant. These modes are underpinned by a unified routing architecture aimed at addressing specific inefficiencies in conventional LLMs—especially in balancing internal reasoning with external tool use.

Framework Overview

A $^2$ FM employs a route-then-align strategy where the model first identifies the optimal mode for a given task, and then aligns the execution of this mode within a shared model backbone. This allows the model to dynamically adapt its processing strategy based on task demands, thus optimizing both accuracy and computational efficiency.

Figure 1: Overview of A $^2$ FM. Left: the framework integrates three execution modes—instant, reasoning, and agentic—under a unified backbone with task-aware routing.

Components of A $^2$ FM

Instant Mode: Handles simple queries without unnecessary complexity, reducing resource requirements.
Reasoning Mode: Engages complex problem-solving tasks, deploying chain-of-thought (CoT) reasoning when needed.
Agentic Mode: Leverages external tools for environment-dependent tasks, apt for scenarios requiring web searches or code execution.

A self-adaptive router orchestrates these modes, leveraging a reinforcement learning-based optimization strategy termed Adaptive Policy Optimization (APO). APO assigns mode-specific sampling and cost-regularized rewards to fine-tune the execution paths further.

Experimental Results

The empirical evaluation underscores A $^2$ FM's superiority across agentic, reasoning, and general domains, consistently outperforming existing LLM baselines. In key benchmarks:

On agentic benchmarks (like BrowseComp), A $^2$ FM achieves a reduction in operational costs while maintaining high accuracy, demonstrating effective mode allocation and tool use.
For reasoning tasks, the model exhibits competitive performance with sophisticated reasoning LLMs, evidenced by strong results in complex mathematical reasoning tasks.
Performance on general benchmarks illustrates A $^2$ FM's adeptness at balancing diverse task requirements, reducing token costs significantly.
Figure 2: Comparison of adaptive mode (red) against forced single modes across four benchmarks.

Figure 3: Relation between task difficulty, allocation ratio, and accuracy for instant and non-instant modes.

Implementation and Adaptability

Adopting A $^2$ FM in practice involves deploying the model's routing system, which must be fine-tuned on task-specific datasets using supervised and reinforcement learning steps:

Stage 1 (Route-then-Align Fine-Tuning): Establish models for mode-specific trajectory generation on diverse datasets.
Stage 2 (Adaptive Policy Optimization): Refine the model's proficiency in selecting operational modes that maximize cost-efficiency without compromising on task accuracy.

This two-stage process promotes a robust adaptive learning framework capable of assimilating explicit guidance through LLM output verification and tool interaction.

Conclusion

The innovative A $^2$ FM framework represents a significant stride towards adaptable LLMs that serve diverse functional requirements. By marrying reasoning strengths with the tactical use of external tools within a unified model, A $^2$ FM delivers impressive gains in both efficiency and robustness across heterogeneous task landscapes. Future research could focus on enhancing the model's real-time adaptability capacities and exploring additional tools and domains to expand practical applicability.