Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

Published 29 Jun 2026 in cs.CL | (2606.30616v1)

Abstract: We introduce Agents-A1, a 35B Mixture-of-Experts Agentic Model that reaches trillion-parameter-level performance by scaling the agent horizon. We investigate agent-horizon scaling from two perspectives: scaling long-horizon trajectories and scaling heterogeneous agent abilities. To support this goal, we build a long-horizon knowledge-action infrastructure that connects external knowledge, actions, observations, and verifier outcomes, producing agentic trajectories with an average length of 45K tokens. Based on this, we train Agents-A1 with a three-stage recipe. First, we perform full-domain supervised fine-tuning to align the base model with broad agentic behaviors. Second, we train domain-level teacher models to capture specialized expertise in each domain. Third, we propose a multi-teacher domain-routed on-policy distillation with salient vocabulary alignment to improve knowledge transfer efficiency across different domains, unifying six heterogeneous domains into one deployable student model. Agents-A1 achieves strong and broad performance for long-horizon agent benchmarks. Compared with 1T-parameter model such as Kimi-K2.6 and DeepSeek-V4-pro, Agents-A1 achieves leading results on SEAL-0 (56.4), IFBench (80.6), HiPhO (46.4), FrontierScience-Olympiad (79.0), and MolBench-Bind (56.8), and remains highly competitive on SciCode (44.3), HLE (47.6) and BrowseComp (75.5). We hope this work provides the community with a practical path for scaling the horizon using a 35B agent that can reach or match the performance of 1T models on long-horizon tasks.

Summary

  • The paper introduces an agent model that achieves trillion-parameter performance by scaling long-horizon planning instead of increasing model size.
  • It employs Knowledge-Action Graphs and a three-stage training pipeline to enable explicit, process-level supervision across multiple domains.
  • Empirical results reveal that the 35B model outperforms same-scale baselines and rivals larger systems on complex, long-horizon agentic tasks.

Scaling Agentic Horizon: Trillion-parameter Performance with a 35B Agentic Model

Motivation and Problem Formulation

The paper "Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent" (2606.30616) addresses a fundamental challenge in modern agentic AI: achieving high competence in long-horizon, tool-rich, and multi-domain tasks without resorting to extreme model parameter scaling. The dominant paradigm in recent LLM research has been parameter scaling (e.g., trillion-parameter models), enabling memorization and implicit reasoning across diverse domains but imposing prohibitive computational and reproducibility requirements. This paper argues for an alternative, horizon-centric agent scaling, focusing on agentic infrastructure, multi-domain compositional abilities, and explicit process-level supervision.

Knowledge-Action Infrastructure for Long-horizon Supervision

A centerpiece of the approach is the design and deployment of a long-horizon Knowledge-Action Infrastructure. The model decomposes heterogeneous corpora into domain-specific Knowledge-Action Graphs (KAGs), where evidence, actions, observations, and verifier outcomes are modeled as linked objects. Agentic abilities are atomized into information acquisition, tool invocation, iterative computation, evidence verification, and constraint tracking—each supervised via KAGs and expanded through self-play proposer-solver-verifier loops. This explicit grounding exposes actionable feedback, records process trajectories, and facilitates cross-step credit assignment, moving supervision beyond isolated textual answers. Figure 1

Figure 1: KAG architecture converts diverse corpora into joint evidence-action-verification chains for supervision and curriculum augmentation.

Multi-domain Data Collection and Task Construction

The data pipeline leverages the KAG for six major agentic domains: long-horizon search, machine learning engineering, scientific research, instruction following, tool-calling, and general agentic tasks. In each domain, tasks are constructed by synthesizing question-answer pairs, constraining solution space, and providing process-level supervision. For example, search-task data consists of controlled random walks on wiki graphs, recording intermediate entity transitions, evidence chains, and verification for indirect answer recovery. Engineering and scientific reasoning tasks use executable trees, tool-augmented trajectories, and evaluation via grader or rubric modules. The instruction-following domain leverages verifiable constraints and long-context evidence, while tool-calling tasks are generated via graph search over dependency graphs, providing process and outcome rewards.

Three-stage Training Pipeline

Agents-A1 employs a three-stage training protocol:

  1. Full-domain Supervised Fine-Tuning: Initializes from a generic MoE baseline (Qwen3.5-35B-A3B) and aligns behaviors across domains through SFT on long-horizon agentic trajectories (average 45K tokens per sample).
  2. Domain-level Teacher Training: Each domain is trained via SFT and/or RL protocols. Search, scientific reasoning, instruction following, and tool-calling domains each employ domain-optimized feedback loops:
    • Search: GRPO-based RL with correctness, efficiency, and repetition penalties.
    • Science: Two-stage SFT for intrinsic reasoning and extrinsic interaction.
    • Instruction: RL with rule-based reward and dynamic sampling.
    • Tool-calling: SFT followed by RL using outcome and process rewards with PAPO-style asymmetric advantage.
  3. Multi-teacher Domain-routed On-Policy Distillation (OPD): Specialized teachers are consolidated into a single deployable student model using OPD with Salient Vocabulary Alignment (SVA). Each sample is routed to its domain-specific teacher, with the SVA loss calculated over teacher-selected token sets and aggregated via domain-normalized objectives to prevent domain imbalance and collapse. Figure 2

    Figure 2: Three-stage pipeline: SFT, domain teacher training, and domain-routed OPD with SVA, integrating heterogeneous agentic abilities.

Empirical Results and Comparative Evaluation

Agents-A1 is evaluated across a comprehensive suite of benchmarks: long-horizon search (BrowseComp, SEAL-0, GAIA), machine learning engineering (SciCode, MLE-Bench-Lite), scientific reasoning (HLE, HiPhO, FrontierScience-Olympiad/Research), instruction following (IFBench, LongBench V2), general agentic tasks (τ2\tau^2-Bench, VitaBench), and specialized scientific agentic tasks (MatTools, MolBench-Bind).

Notably, Agents-A1 achieves:

  • SEAL-0: 56.4
  • IFBench: 80.6
  • HiPhO: 46.4
  • FrontierScience-Olympiad: 79.0
  • MolBench-Bind: 56.8

These results are highly competitive with, and in several cases surpass, leading 1T-parameter models such as Kimi-K2.6 and DeepSeek-V4-Pro. For BrowseComp, XBench-DS-2510, and GAIA, Agents-A1 either matches or exceeds performance of larger frontier models. On engineering and scientific tasks, Agents-A1 outperforms all same-scale (35B) baselines and narrows the gap to state-of-the-art trillion-parameter systems. Figure 3

Figure 3: Agents-A1 benchmark scores, showing parity and superiority over both same-scale and 1T-parameter models on long-horizon agentic tasks.

Optimization Trajectories and Case Studies

The practical agentic competence of Agents-A1 is demonstrated via extended optimization runs (e.g., 12-hour MLE optimization on whale call detection), revealing non-trivial sequences of interventions, robust goal-tracking, and agile integration of augmented representations and cross-domain tools. Across thousands of turns, the agent executes planning, reflection, iterative evidence acquisition, and persistent memory utilization, driving validation metrics to gold-medal benchmarks. Figure 4

Figure 4: Agents-A1 optimization trajectory on MLE task, exhibiting sustained AUC improvement and distinct algorithmic breakthroughs.

In Earth science cases, Agents-A1 reconstructs cyclone tracks from IBTrACS data, generates domain-aligned diagnostics, and articulates cross-validations between conflicting operational convention estimates, showcasing advanced closed-loop planning and analysis. Figure 5

Figure 5: Agents-A1-generated cyclone analysis, integrating physical trajectory, diagnostic computation, and stepwise annotation.

Implications, Limitations, and Future Directions

The primary implication is a technical path toward unified agentic competence—scaling actionable horizon rather than parameters, by deploying explicit process supervision and multi-domain teacher distillation. This enables practical agentic deployment at reproducible compute, democratizing advanced reasoning and interactive ability.

A key limitation remains on engineering tasks demanding persistent goal memory, lengthy experiments, and non-static optimization, where Agents-A1, while superior to same-scale baselines, trails larger models. The paper underscores the importance of basic atomic abilities for horizon scaling: planning, reflection, summarization, and adaptive recall. The approach’s modularity suggests feasibility for future research into horizon-centric atomic abilities, agentic curriculum design, and scaling persistent process memory.

Conclusion

Agents-A1 demonstrates that scaling agentic horizon via process-structured supervision, multi-domain teacher policies, and OPD with SVA delivers robust, deployable competence across long-horizon agentic tasks—matching or exceeding trillion-parameter models at only 35B scale. The Knowledge-Action Infrastructure and the domain-routed OPD pipeline collectively chart a reproducible, actionable direction for agentic AI research, supporting scalable agentic deployment and further horizon-centric advances.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

What this paper is about

This paper introduces Agents-A1, an AI “agent” that’s good at solving long, complicated tasks by planning many steps ahead and using tools like web search, code execution, and document reading. Instead of making the AI much bigger (with more parameters), the authors show you can make it much smarter at long tasks by improving how far it can plan and how well it learns from real actions and feedback. Their 35-billion-parameter model matches or beats the performance of some models with around a trillion parameters on tests that require long, careful reasoning.

The big questions the authors asked

  • Can a smaller AI agent (35B) reach the performance of much larger ones (≈1T) on tasks that need many steps, planning, and tool use?
  • What kind of training “world” (data and environment) does an AI need to really learn long, reliable decision-making?
  • How can we combine many different skills (like searching the web, writing and running code, following strict instructions, doing science problems) into one model without the skills clashing?

How they did it (methods)

The authors’ approach has two main ideas: build the right practice world, then teach the model in three stages.

Building a “knowledge–action” world

The team created a training setup where the AI doesn’t just read text—it takes actions, observes what happens, and gets checked. Think of it like a detailed lab notebook or a game replay that records every step:

  • What the agent knows so far (evidence it found)
  • What action it took (searching, clicking a link, calling a tool, editing code)
  • What it saw next (webpage content, tool output, error messages)
  • Whether a “verifier” says the step or final answer is correct

They store this as a “Knowledge–Action Graph” (KAG). It’s like a map showing not only the final answer but the whole journey, including mistakes and fixes. This helps the agent learn how to plan, recover from errors, and verify evidence—skills that normal text datasets don’t teach well.

To make lots of good practice problems, they use a “proposer–solver–verifier” loop (self-play):

  • A proposer creates tasks.
  • A solver (the AI) tries to solve them using tools.
  • A verifier checks if the steps and final answer are correct and non-cheaty (no shortcuts). Good tasks and solution paths are saved, bad ones are fixed or thrown out.

They build this for several domains, for example:

  • Search over a wiki-like web graph to find answers via multi-hop links, keeping track of the pages and evidence used.
  • Machine Learning Engineering (MLE): write, run, and improve code to get better scores on competitions, with a “tree” of code versions and results.
  • Scientific reasoning: step-by-step math/physics reasoning, with and without tools (search, code, scholar).
  • Instruction following: obey strict, checkable rules (format, length, language) and find answers in long documents while ignoring distractors.
  • General tool calling: solve tasks by using a chain of compatible tools in the right order, with checks that the tools were used correctly.

In short, the KAG ensures the agent learns from realistic, verifiable, multi-step practice, not just final answers.

Training in three steps

The training recipe has three stages:

  • Stage 1: Supervised fine-tuning (SFT) across all domains The model (a 35B Mixture-of-Experts model—think of it as a team of specialist mini-models) learns general “agent” behavior from many long trajectories (on average about 45,000 tokens each). It learns to plan, use tools, verify, and summarize.
  • Stage 2: Train domain-specific “teacher” models For each domain (like web search or scientific reasoning), they specialize a teacher model. For example, the search teacher is further trained with reinforcement learning (RL), where the model gets rewards for correct answers and for efficient, non-repetitive searching.
  • Stage 3: Combine the skills into one student (multi-teacher distillation)
    • Domain routing: each training example is guided by the teacher from its own domain, so signals don’t conflict.
    • Salient Vocabulary Alignment (SVA): instead of comparing the entire vocabulary at each step, the student focuses on the teacher’s most important likely next tokens (a “local shortlist”), making learning more stable and efficient.
    • Balanced updates: they normalize training so no single domain overwhelms the others.

This three-step process unifies many different, sometimes clashing skills into one deployable agent.

What they found

Agents-A1, with just 35B parameters, achieves top or leading scores on several “long-horizon” benchmarks—tests that require many steps, tool use, and careful reasoning. It beats or matches some ≈1T-parameter models (like Kimi-K2.6 and DeepSeek-V4-pro) on:

  • SEAL-0 (open-ended answer finding with evidence)
  • IFBench (strict instruction following)
  • HiPhO (reasoning)
  • FrontierScience-Olympiad (challenging science problems)
  • MolBench-Bind (molecular binding reasoning)

It also performs strongly on:

  • SciCode (coding)
  • HLE (human-like evaluation on complex tasks)
  • BrowseComp (web browsing competition)

Why this matters: It shows that “thinking longer and better” (scaling the horizon) can rival “being bigger” (scaling parameters), especially for real-world tasks where you must plan, search, verify, and adapt.

Why it matters (implications)

  • A practical path to powerful agents without massive size: Smaller models can reach big-model performance if trained to plan far ahead, use tools well, and learn from step-by-step, verifiable feedback.
  • More accessible and cost-effective: Training and running a 35B model is far cheaper than a 1T model, lowering barriers for labs, startups, and researchers.
  • Better real-world reliability: By grounding decisions in evidence and verification, the agent is more likely to stay accurate over long tasks and recover from mistakes.
  • Broad, reusable foundation: The knowledge–action infrastructure and multi-teacher distillation recipe can be reused to add new domains, tools, or skills over time.

In simple terms: the authors show that teaching an AI how to plan carefully, use tools, and check its work—over long stretches—is a powerful alternative to just making it bigger. This could make advanced AI agents more affordable, reliable, and widely available.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of the main uncertainties and omissions that future work could address:

  • MoE architecture and routing are unspecified (number of experts, layer placement, gating, load balancing), making it unclear how parameter- and compute-efficiency are achieved and how inference latency scales.
  • Lack of ablations quantifying the contribution of each component (KAG infrastructure, SFT, domain-level teachers, RL on search, OPD, Salient Vocabulary Alignment) to overall gains.
  • No sensitivity analysis for SVA hyperparameters (e.g., top‑k size, renormalization choices) or for the domain-normalized objective (e.g., domain weighting, batch composition).
  • Domain routing at inference is undefined: how tasks are assigned to teachers or routed policies when domain labels are ambiguous or multi-domain, and how the system handles unseen domains.
  • On-policy distillation may amplify student errors due to conditioning teachers on student prefixes; there is no analysis of error propagation or mitigation (e.g., hybrid off-policy targets).
  • SVA monitors coverage ρ but does not use it for training control (e.g., regularization, adaptive k); how low coverage affects convergence and generalization remains open.
  • Comparisons to 1T-parameter baselines lack details on evaluation parity (identical tools, prompts, context lengths, temperature, decoding, retriever quality) and do not report variance or statistical significance.
  • Compute/latency/memory costs for training and inference with 45K–256K contexts and many tool calls are not reported; cost–performance trade-offs and energy efficiency are unknown.
  • Long-horizon robustness is unmeasured beyond final accuracy (e.g., path correctness, recovery from early mistakes, compounding errors, and efficiency metrics like tool-call count and wall-clock time).
  • KAG quality and verifier trustworthiness rely on LLM judges and automatic checks; calibration, bias, false positives/negatives, and susceptibility to reward hacking are not evaluated.
  • Potential data contamination/leakage between KAG-derived data (e.g., Wikipedia/web content, MLE tasks) and evaluation benchmarks is not audited; overlap checks and decontamination protocols are missing.
  • Tool-sandbox safety and isolation are under-specified (resource limits, network/FS access policies, side-effect containment, dependency control), posing reproducibility and security questions.
  • Generalization to unseen or changing tools/APIs and web environments (dynamic content, cookie walls, anti-bot measures) is not studied; robustness to schema drift remains unclear.
  • The “read_page” step relies on LLM summarization; the impact of summarizer quality and information loss on downstream performance is unquantified.
  • Context management strategy (KV-cache policies, compaction/summarization criteria, retrieval vs. full-context usage) and its effect on accuracy over very long horizons are not detailed.
  • Handling of tool failures (timeouts, malformed returns, partial data) and fallback policies (retries, backoff, degraded modes) at inference time are not specified.
  • Cross-domain interference and negative transfer are not analyzed; whether domain-normalized aggregation prevents catastrophic forgetting or harms rare domains is unknown.
  • Multilingual capability and cross-lingual generalization are unreported (data language mix, performance across languages, tool behavior in non-English contexts).
  • MLE domain reproducibility (dataset licensing, environment pinning, seeding, leakage from public Kaggle solutions) and generalization to new competitions are not assessed.
  • Instruction-following verifiers focus on surface constraints; generalization to semantic, ambiguous, or conflicting constraints (and trade-offs under conflict) is untested.
  • Reward shaping for search RL (free rounds K, penalty scales, repetition windows) lacks ablations; the accuracy–efficiency trade-off and potential for gaming the reward are unexamined.
  • Horizon-scaling limits are unclear: behavior beyond 256K context, diminishing returns with longer trajectories, and the comparative value of horizon vs parameter scaling are unquantified.
  • Alternative distillation objectives (full-vocabulary KL, temperature-scaled KD, sequence-level objectives) are not compared; convergence and stability trade-offs remain open.
  • Inference-time planning/reflection routines (self-critique, verifier-guided adjustments, rollback) are not described; whether training-time verifiers are used at inference is unclear.
  • Benchmark reporting lacks per-task breakdowns, multiple seeds, confidence intervals, and robustness checks (distribution shifts, OOD tasks).
  • Release scope is ambiguous: availability of KAG datasets, verifiers, tool harnesses, and teacher checkpoints needed for reproducibility is not clarified.
  • Safety and alignment are not addressed (harmful content handling, prompt injection, tool misuse, code execution risks, browsing safety); no red-teaming or guardrail evaluation is provided.
  • Ethical/legal aspects of web data and competition data usage (licensing, consent, derivative works) are not discussed.
  • Interpretability is absent: there is no analysis linking “atomic abilities” to emergent behaviors or model internals, nor diagnostics tying KAG structure to learned policies.
  • Domain definitions and routing granularity are unclear (how the six domains are delineated, how overlapping tasks are handled, whether routing is token-, turn-, or task-level).
  • Persistence and integrity of agent memory (write_notes/read_notes) under context compaction are not evaluated for drift, hallucinated memory, or long-run consistency.
  • Adversarial/OOD robustness is untested (poisoned web pages, schema poisoning, adversarial prompts, misleading evidence); defenses and detection mechanisms are unspecified.

Practical Applications

Immediate Applications

The following applications can be deployed now, leveraging the paper’s released methods and demonstrated performance on long‑horizon benchmarks (SEAL‑0, IFBench, HiPhO, FrontierScience‑Olympiad, MolBench‑Bind, SciCode, HLE, BrowseComp) and the described infrastructure (Knowledge‑Action Graphs, tool sandbox, multi‑teacher on‑policy distillation with salient vocabulary alignment).

  • Evidence‑grounded research assistant for long‑form investigations
    • Sector: software, media, finance, policy analysis, enterprise R&D
    • What it does: Conducts multi‑step web research (search, read_page, code summarization), cites provenance, integrates multi‑source evidence, and verifies findings. Useful for analyst briefings, due‑diligence reports, competitive analysis, and literature reviews.
    • Tools/products/workflows: “Deep Research” copilot with KAG-based provenance; workspace plugin that attaches source chains to every claim; report templates with verifier checks (correctness, coverage).
    • Assumptions/dependencies: Access to commercial search APIs and scraping/summarization stack; 256K‑token context or equivalent memory; content licensing/compliance; robust verifiers tuned to domain.
  • Instruction‑following content generator with strict constraint tracking
    • Sector: education, customer support, marketing, public sector communications
    • What it does: Produces outputs adhering to length, format, language, keyword, and structure constraints with automatic validation (IFBench‑style capabilities).
    • Tools/products/workflows: CMS integrations that prevalidate responses; “policy‑compliant drafting assistant” for templated communications.
    • Assumptions/dependencies: Validator coverage for constraints; clear templates/specs; monitoring for edge cases where constraints conflict.
  • Scientific reasoning and literature‑aware STEM problem solving
    • Sector: academia, education, industrial R&D (materials, chemistry, physics)
    • What it does: Solves math/physics problems with verifiable multi‑step derivations; invokes code, search, scholar tools for numeric/symbolic computation and literature grounding.
    • Tools/products/workflows: Courseware tutor producing stepwise solutions; lab assistant drafting methods sections with cited prior art; simulation‑assisted problem solvers.
    • Assumptions/dependencies: Reliable execution sandbox for code; access to scholarly APIs; human review for high‑stakes scientific claims.
  • Machine Learning Engineering (AutoML‑style) agent for competition‑like tasks
    • Sector: software, MLOps, data science platforms
    • What it does: Iteratively authors, patches, executes, and evaluates ML pipelines in a verifiable tree (write_full_code, patch_code, execute_code, analyze, update_answer).
    • Tools/products/workflows: “MLE harness” plugin for internal Kaggle‑like experiments; continuous model‑tuning assistant that documents decisions and results via KAG.
    • Assumptions/dependencies: Sandbox with dataset access and evaluator; graded tasks or internal metrics; governance for code execution permissions and resource usage.
  • Tool‑orchestration copilot for enterprise workflows
    • Sector: software, operations, BI/analytics, internal tooling
    • What it does: Calls chained tools based on schema compatibility and state dependencies; tracks state updates and verifier outcomes to reach goal completion.
    • Tools/products/workflows: Low‑code agent that sequences ETL, analytics queries, report generation, and notifications; “graph‑composed” workflows ensuring each step is grounded in prior state.
    • Assumptions/dependencies: Stable tool schemas and state models; sandbox/tool sandboxing; domain verifiers for success criteria.
  • Long‑context document QA and policy adherence checks
    • Sector: legal ops (non‑advisory), compliance, procurement, HR
    • What it does: Answers multi‑hop questions over long documents; applies in‑context rules and rejects distractors; produces evidence chains.
    • Tools/products/workflows: “Contract navigator” for clause retrieval and checklist validation; “policy adherence checker” for procedural documents.
    • Assumptions/dependencies: High‑quality document parsing and entity graphs; agreement on validators; legal review in high‑stakes settings.
  • Cost‑efficient deployment for long‑horizon tasks with smaller models
    • Sector: AI platforms, SaaS
    • What it does: Delivers near trillion‑parameter performance on long‑horizon tasks with a 35B MoE model via horizon‑scaling, reducing inference cost/latency compared to 1T models.
    • Tools/products/workflows: Model hosting with long‑context support; runtime policy enforcing tool limits and verifier feedback loops.
    • Assumptions/dependencies: Hardware with sufficient memory for 35B MoE and long context; routing to teacher‑distilled student; latency budgets compatible with multi‑turn tool calls.
  • Agent training and audit infrastructure for internal AI teams
    • Sector: AI engineering, applied research
    • What it does: Uses KAGs to create verifiable, evolvable training data; applies multi‑teacher on‑policy distillation with salient vocabulary alignment to merge domain experts into one agent.
    • Tools/products/workflows: “Trajectory ledger” for tools/evidence/verification; OPD‑SVA training pipelines to consolidate departmental agents.
    • Assumptions/dependencies: Availability of domain teachers; data engineering for KAG construction; MLOps capacity for OPD.
  • Education: step‑by‑step STEM tutoring with verifiable reasoning
    • Sector: education technology
    • What it does: Provides guided solutions with intermediate checks; blends no‑tool reasoning and code‑augmented computation.
    • Tools/products/workflows: Tutor chat with step validation; homework assistance that cites sources and shows derivations.
    • Assumptions/dependencies: Academic integrity policies; restricted tool modes for exams; accuracy audits.
  • Journalism and knowledge management with provenance
    • Sector: media, knowledge operations
    • What it does: Curates stories/reports with linked evidence trails; flags unverified claims via verifiers; maintains a KAG for editorial review.
    • Tools/products/workflows: Newsroom research assistant; editorial dashboards showing action‑observation chains.
    • Assumptions/dependencies: Editorial standards; caching strategies to reduce API costs; bias and source‑credibility checks.

Long‑Term Applications

These applications are promising but require additional research, domain‑specific tooling, scaling, or regulatory alignment before wide deployment.

  • Clinical literature synthesis and decision support with audit trails
    • Sector: healthcare
    • What it could do: Multi‑step evidence‑backed clinical Q&A and guidelines synthesis with verifiable citations and tool‑assisted computations (dosage calculators, risk models).
    • Tools/products/workflows: “Evidence copilot” integrated with clinical knowledge bases; KAG‑logged consultations for audit.
    • Assumptions/dependencies: Regulatory clearance, medical device compliance; updated, licensed clinical sources; strict human‑in‑the‑loop and safety verifiers.
  • Automated regulatory and policy analysis with traceability
    • Sector: government, legal/compliance, energy/environment policy
    • What it could do: Traverse long statutes/policies, produce impact analyses with evidence chains, and check compliance constraints across documents.
    • Tools/products/workflows: Policy analysis workbench; rule‑graph validators; cross‑document constraint trackers.
    • Assumptions/dependencies: Domain‑specific verifiers for legal/policy correctness; access to up‑to‑date corpora; expert oversight.
  • Scientific discovery loops and closed‑loop experimentation
    • Sector: academia, pharmaceuticals, materials, energy
    • What it could do: Hypothesis generation → literature grounding → simulation/analysis → planning next experiments, with KAGs linking each step.
    • Tools/products/workflows: Lab OS integration (ELNs, LIMS), simulation tools; verifier modules for reproducibility and methodology checks.
    • Assumptions/dependencies: High‑fidelity simulators; lab equipment integration; robust safety and ethical governance.
  • Production AutoML and model lifecycle management agents
    • Sector: software, MLOps, finance, e‑commerce
    • What it could do: Extend the MLE harness to managed AutoML: feature engineering, model selection, deployment, monitoring, and rollbacks, all with verifiable trajectories.
    • Tools/products/workflows: “AutoMLE+Ops” agent controlling CI/CD for models with policy gates and verifiers for fairness, drift, and performance.
    • Assumptions/dependencies: Enterprise MLOps integration; risk controls; domain‑specific evaluators; change‑management policies.
  • Robotics and autonomous operations with long‑horizon planning
    • Sector: robotics, manufacturing, logistics
    • What it could do: Map the tool‑calling KAG to robotic actions and sensor states, enabling verifiable long‑horizon plans with failure recovery.
    • Tools/products/workflows: Planner that composes skills via state‑transition graphs; safety verifiers for constraints and recovery policies.
    • Assumptions/dependencies: Real‑time perception/control integration; rigorous safety certification; simulation‑to‑real transfer.
  • Financial analysis copilots with compliance and auditability
    • Sector: finance
    • What it could do: Long‑form multi‑document analysis (filings, transcripts), scenario modeling via code tools, and compliance checks with audit trails.
    • Tools/products/workflows: Research workstation with data vendor APIs; compliance verifiers; portfolio risk analysis notebooks linked to KAG events.
    • Assumptions/dependencies: Licensed market data; strict compliance and controls; model risk management frameworks.
  • Enterprise tool‑graph orchestration layer (“Agentic OS”)
    • Sector: enterprise software, operations, CRM/ERP
    • What it could do: Standardize tool schemas and dependency graphs across departments; route tasks through verifiable multi‑step agent workflows.
    • Tools/products/workflows: Central “agent bus” with schema registry, state store, and verifier catalog; per‑domain teacher models unified via OPD‑SVA.
    • Assumptions/dependencies: Organization‑wide API standardization; security and access controls; monitoring and rollback frameworks.
  • KAG‑driven AI governance and audit products
    • Sector: cross‑industry governance, risk, and compliance (GRC)
    • What it could do: Provide regulators and internal auditors with reproducible, step‑level logs of agent decisions, evidence, and verifier outcomes for any automated process.
    • Tools/products/workflows: “Trajectory auditor” dashboards, evidence replay, risk scoring, and anomaly detection over agent logs.
    • Assumptions/dependencies: Standardized event schemas; data retention policies; privacy controls.
  • Education at scale: personalized mastery learning with verifiers
    • Sector: education
    • What it could do: Adaptive curricula where each reasoning step is validated; automatic remediation plans; code‑assisted experimentation.
    • Tools/products/workflows: LMS integration with per‑student KAGs; assessment verifiers; intervention recommendations.
    • Assumptions/dependencies: Alignment with curricula/assessments; guardrails against over‑assistance; accessibility and fairness considerations.
  • Safety‑critical multi‑agent simulation and planning
    • Sector: transportation, energy grid operations, emergency response
    • What it could do: Plan and evaluate multi‑step interventions in simulated environments with verifiable outcomes and counterfactuals.
    • Tools/products/workflows: Simulator integrations; domain verifiers for constraints (e.g., grid stability, evacuation protocols).
    • Assumptions/dependencies: High‑fidelity simulators; human oversight; regulatory approvals.

Cross‑cutting assumptions and dependencies

  • Long‑context and compute: Many workflows rely on very long context windows (up to ~256K tokens) and a 35B MoE model; deployment needs GPUs/TPUs with sufficient memory and optimized serving.
  • Tooling and verifiers: Effectiveness hinges on stable tool schemas, secure sandboxes for code execution, and robust verifier design per domain.
  • Data governance and compliance: Web and document ingestion must respect licensing and privacy; high‑stakes domains require human‑in‑the‑loop and formal risk controls.
  • Domain adaptation: Best performance requires domain‑specific teachers and careful OPD‑SVA routing; generalization depends on KAG quality and data diversity.
  • Monitoring and safety: Multi‑turn agents can drift or loop; policies for tool limits, repetition penalties, and failure recovery should be enforced in production.

Glossary

  • Agent horizon: The temporal extent over which an agent plans, acts, observes, and adapts across many steps. "reaches trillion-parameter-level performance by scaling the agent horizon."
  • Agentic harness: A tooling framework that manages executable attempts as a tree of solution nodes for iterative agent development. "Trajectories are generated in an agentic harness that grows a tree of executable solution nodes,"
  • Agentic model: A model designed to operate as an autonomous agent with planning, tool use, and iterative decision-making. "we introduce Agents-A1, a 35B MoE agentic model designed to address the key challenges mentioned above."
  • Agentic trajectories: Recorded sequences of decisions, tool calls, observations, and verifications generated by an agent during long-horizon tasks. "producing agentic trajectories with an average length of 45K tokens."
  • Context compaction: Summarizing earlier steps into a smaller digest to manage very long contexts during multi-turn processes. "context compaction summarizes earlier steps into a digest."
  • Coverage (student-side coverage): The fraction of the student’s probability mass that falls within the teacher-selected token support during alignment. "We therefore monitor the student-side coverage"
  • Cross-step credit assignment: Assigning learning signal to intermediate decisions across a trajectory based on later outcomes. "enabling cross-step credit assignment and reproducible long-horizon supervision."
  • Dependency graph: A graph encoding executable dependencies among tools, states, and resources used to synthesize valid multi-step tasks. "Task synthesis is formulated as constrained graph search over this dependency graph"
  • Domain-aware aggregation: Combining supervision signals while accounting for per-domain balance to avoid dominance by frequent or high-loss domains. "routed teacher guidance, salient vocabulary alignment, and domain-aware aggregation,"
  • Domain-normalized objective: A loss that averages within each active domain and then across domains to balance multi-domain training. "we aggregate SVA losses with a domain-normalized objective, averaging within each active domain and then across active domains,"
  • Domain-routed on-policy distillation (OPD): Distilling from domain-specific teachers while supervising only on student-generated rollouts, routed by domain. "we propose a domain-routed on-policy distillation (OPD) with salient vocabulary alignment"
  • Graph-compositional task synthesis: Constructing tasks by composing connected subgraphs of tool and state dependencies to ensure executability and grounding. "graph-compositional task synthesis, and solvability assessment."
  • GRPO: A reinforcement learning policy optimization algorithm used to fine-tune agent behavior with reward signals. "We adopt GRPO~\citep{shao2024deepseekmath} as our RL algorithm."
  • Hard domain routing: Supervising each training sample with only its corresponding domain-specific teacher rather than mixing teachers. "We use hard domain routing, where each sample is supervised only by the teacher trained for its domain,"
  • Knowledge-action graph (KAG): A structured representation linking evidence, actions, observations, and verifier outcomes to capture process-level supervision. "organized into a knowledge-action graph (KAG) that records evidence, actions, observations, and verifier outcomes."
  • Knowledge-action infrastructure: A system that connects external knowledge, tools, actions, and verification signals to produce verifiable long-horizon supervision. "we construct a knowledge-action infrastructure that converts heterogeneous corpora into compositional, verifiable, and self-extending supervision,"
  • LLM judge: A LLM used as an evaluator to assess the correctness or quality of model outputs. "We employ an LLM judge to evaluate whether the model's final answer is correct."
  • Mixture-of-Experts (MoE): A model architecture that routes inputs to different expert subnetworks to improve efficiency and specialization. "We introduce Agents-A1, a 35B Mixture-of-Experts Agentic Model"
  • Process-level supervision: Training supervision that targets intermediate reasoning and actions, not just final answers. "providing process-level supervision beyond final answers,"
  • Proposer--solver--verifier game: A self-play data-generation loop where tasks are proposed, solved, and verified to expand high-quality supervision. "we expand Gd\mathcal{G}_d through a proposer--solver--verifier game:"
  • Reverse KL (truncated reverse KL): A divergence objective comparing student to teacher distributions, restricted to a subset of salient tokens. "The per-sample SVA objective is the truncated reverse KL over this salient support,"
  • Rollout student: The student model that generates on-policy outputs used for distillation or training supervision. "a frozen rollout student samples yiπθs(xi)y_i\sim\pi_{\theta_s}(\cdot\mid x_i),"
  • Routed teacher: The specific domain teacher assigned to supervise a sample based on its domain label. "the routed teacher θt,iθtdi\theta_{t,i}\triangleq\theta_t^{d_i}."
  • Salient Vocabulary Alignment (SVA): Aligning student and teacher probabilities over a compact, teacher-selected set of high-probability tokens. "with salient vocabulary alignment (SVA)."
  • Sample packing: Concatenating multiple short training examples into a single long sequence to improve throughput and reduce padding. "we adopt a sample packing strategy that concatenates multiple short examples into a single training sequence"
  • Schema-grounded calls: Tool invocations constrained and validated by formal tool schemas during interaction. "consists of schema-grounded calls and clarification actions;"
  • Self-play: Automated iterative data generation where the model (or models) interact in roles to create and verify training trajectories. "A tool-augmented self-play loop expands the KAG into domain-specific sub-KAGs"
  • Sub-KAGs: Domain-specific subgraphs derived from the overall knowledge-action graph for focused training and tasks. "sub-KAGs for coding, agentic reasoning, instruction following, MLE, and scientific reasoning."
  • Tool Sandbox: A controlled environment that exposes selected tools and maintains evolving state for safe, reproducible tool interactions. "Trajectory generation is performed in a Tool Sandbox,"
  • Tool-augmented reasoning: Reasoning that leverages external tools (e.g., search, code, scholar) to obtain evidence or compute results. "we construct both no-tool and tool-augmented reasoning trajectories"
  • Top-k valid tokens: The set of highest-probability tokens under the teacher used as the alignment support in SVA. "be the set of top-kk valid tokens under the routed teacher distribution."
  • Verifier-guided graph search: Exploring candidate trajectories with verifier signals to select valid, grounded, and successful runs. "This process is treated as verifier-guided graph search over the trajectory space."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 1403 likes about this paper.