Papers
Topics
Authors
Recent
Search
2000 character limit reached

Inference-Time Agent Selection

Updated 9 February 2026
  • Inference-time agent selection is a dynamic framework that chooses and adapts agentic policies during inference rather than relying on static, pre-trained models.
  • It utilizes meta-strategy induction, dynamic memory integration, and bandit algorithms to select the best ensemble of agents based on real-time feedback and online experience.
  • This approach enhances accuracy, reduces computational cost, and improves adaptability across diverse AI tasks and problem domains.

Inference-time agent selection refers to a class of methodologies and mechanisms that select, compose, or adapt agentic policies or entire computational strategies—such as LLM calls, tool chains, or hybrid workflows—dynamically at inference, rather than statically at train time. This enables AI systems to optimize for correctness, efficiency, and adaptability in response to novel inputs and accumulation of online experience. Unlike model fine-tuning or offline retraining, inference-time selection operates over ensembles of agents, strategies, or algorithms and leverages dynamic evaluation, meta-reasoning, or reinforcement, often with persistent memory and feedback integration. State-of-the-art approaches span domains from LLM meta-strategy learning to graph-based tool routing.

1. Formalization and Canonical Architectures

The core abstraction is a meta-strategy or selection policy

π:X×MΔ(S)\pi: X \times M \to \Delta(\mathcal{S})

where XX is the task/query space, MM is system memory (e.g., history, experience buffer), and S\mathcal{S} is the space of candidate agents or strategies. For a given xXx \in X and experience MM, the policy returns a distribution (or deterministic selection) over S\mathcal{S}. Key architectures employing this formalism include:

  • EGuR (Experience-Guided Reasoner) frames agent selection as meta-strategy induction, where an LLM-based Guide generates a small batch of candidate strategies conditioned on (x,M)(x, M), and a Consolidator updates MM with execution outcomes, enabling online synthesis of hybrid procedures involving prompts, sampling parameters, tool configurations, and control logic (Stein et al., 14 Nov 2025).
  • MetaOrch formalizes agent selection in multi-agent systems as a neural classification problem with a supervised orchestrator network that maps contextualized task-agent representations to selection scores, leveraging soft (“fuzzy”) evaluation signals during training (Agrawal et al., 3 May 2025).
  • PlanGEN employs a modular selection agent built around a modified UCB bandit rule, blending LLM-guided priors and empirical rewards for pipeline-level algorithm selection (e.g., Best-of-N, Tree-of-Thought) in planning and reasoning domains (Parmar et al., 22 Feb 2025).
  • DyLAN derives “Agent Importance Scores” by propagating peer ratings through the agent network in an unsupervised backward aggregation, enabling unsupervised agent selection based on trial-based contribution metrics (Liu et al., 2023).

2. Algorithms and Selection Criteria

Inference-time agent selection employs a spectrum of algorithms tailored to the granularity and setting:

  • Population-based/batch selection (EGuR, TUMIX): A meta-policy generates kk candidate strategies for the current task, executes them in parallel, and applies an explicit reward- or feedback-based selection rule. In EGuR, relative “Group Relative Policy Optimization” credit is assigned using correctness/cost feedback:

ri=ji[1{fi>fj}1{fi<fj}]λ(cicˉ)r_i = \sum_{j \ne i} [\mathbb{1}\{f_i > f_j\} - \mathbb{1}\{f_i < f_j\}] - \lambda(c_i - \bar{c})

The Guide module is conditioned on retrieved prior strategies and notes from memory, enabling adaptation over the entire strategy space (Stein et al., 14 Nov 2025).

  • Fuzzy and listwise strategies (MetaOrch): Candidate agents’ output quality is evaluated along completeness, relevance, and confidence axes, yielding a soft label ysofty_\text{soft} which supervises a neural selector. At inference, selection is by highest predicted suitability (Agrawal et al., 3 May 2025).
  • Bandit and UCB-based adaptive selection (PlanGEN): The agent’s action is

UCB(a)=R(a)N(a)+2ln(T+1)N(a)+λpriorPrior(a)+αdiv1N(a)+1+αrecSrecovery(a)\mathrm{UCB}(a) = \frac{R(a)}{N(a)} + \sqrt{\frac{2 \ln (T{+}1)}{N(a)}} + \lambda_{\rm prior} \cdot \mathrm{Prior}(a) + \alpha_{\rm div}\frac{1}{N(a)+1} + \alpha_{\rm rec} S_{\mathrm{recovery}}(a)

balancing exploitation, exploration, diversity, recovery after bad runs, and LLM-estimated suitability (Parmar et al., 22 Feb 2025).

  • Graph-based and structured retrieval (Agent-as-a-Graph, AutoTool): Agent or tool selection is realized as node retrieval in a bipartite knowledge graph indexed with embeddings, optionally augmented by type-aware weighted reciprocal rank fusion and explicit tool-ownership edge traversals for agent disambiguation (Nizar et al., 22 Nov 2025). In tool-focused settings, AutoTool leverages empirically observed tool usage inertia, traversing a learned tool/parameter transition graph to minimize LLM query overhead (Jia et al., 18 Nov 2025).

3. Dynamic Memory, Feedback Integration, and Online Learning

Inference-time selection frameworks typically maintain non-ephemeral state across problem instances, accumulating and abstracting execution traces, feedback, and heuristic notes. Representative implementations include:

  • EGuR’s memory MM comprising a strategy library L={(ϕ(x),s)}L = \{(\phi(x’), s’)\} mapping problem signatures (e.g., embeddings) to effective strategies, and general notes NN of operational heuristics. The experience memory is updated post-hoc by the Consolidator, which synthesizes abstracted rules for future Guide conditioning (Stein et al., 14 Nov 2025).
  • Persistent state in evolutionary search captures the growing set of proposals, coverage statistics, robustness signals, and rewards. The controller conditions agent selection and generation on the full historical state rather than only recent context, avoiding the Markov limitation of naive ablation (Lalan et al., 8 Oct 2025).
  • Verifier-guided iterative evaluation (IAD): Systems like IAD dynamically interleave proposal generation, reward evaluation, and feedback-based prompt construction, explicitly conditioning further iterations on both the best and worst previously observed responses, enforcing non-decreasing solution quality (Chakraborty et al., 2 Apr 2025).

Memory curation usually involves abstraction and selective retention, discarding stale or infrequently retrieved entries, and (in EGuR) caching effective strategies for future retrieval.

4. Impact of Agent Diversity, Ensemble Strategies, and Early Stopping

Several empirical studies converge on the importance of maintaining high agent diversity and dynamic ensemble selection:

  • TUMIX demonstrates superior coverage and accuracy scaling with a mixture of 15–30 agents, each employing distinct tool/prompt/logic combinations. LLM-driven auto-design further amplifies diversity, with random subset selection and majority-voting delivering up to 3.55% accuracy gains over best single-agent/ensemble baselines (Chen et al., 30 Sep 2025).
  • Majority-vote aggregation, halting by LLM-as-Judge, and ensemble pruning emerge as recurring mechanisms: TUMIX leverages a judge agent for dynamic early stopping, reducing inference cost to 49% of fixed-round baselines without loss of performance. Ablations confirm that higher agent modality diversity (e.g., text+code+search) yields additive gains, and that scaling number of unique agent types saturates beyond 12–15 (Chen et al., 30 Sep 2025).
  • Dynamic selection versus static ensembles: Static heuristic ensembles (e.g., single prompt per tool) cannot adapt or improve from experience, whereas selection policies that leverage memory, online scoring, and dynamic feedback achieve both higher mean performance and steeper improvement curves as experience accumulates (EGuR, TUMIX) (Stein et al., 14 Nov 2025, Chen et al., 30 Sep 2025).

5. Empirical Performance and Limitations

Key quantitative results, along with system limitations, are as follows:

System Main Task / Domain Accuracy Improvement Cost Reduction Key Limitations
EGuR 3-SAT, AIME, BBH 14% rel. ↑ (3-SAT 96% vs 77%) up to 111× Needs oracle verifier; zero-shot LLM prompts
MetaOrch Simulated MAS domains 86.3% selection accuracy Needs fuzzy evaluation; specialist–generalist ambiguity
TUMIX Reasoning, hybrid QA up to +3.55% (ensemble) 51% cost saved Diminishing returns >15 agents
AutoTool ScienceWorld, AlfWorld 0.394→0.531 (AlfWorld PR) 10–40% token-in, 30–65% out Cold start, open-ended tasks less effective
PlanGEN SelectionAgent Calendar, PlanBench, QA +3–5 EM points No end-to-end training; UCB hyperparameters by grid search
DyLAN MMLU, HumanEval up to +25.0% (subject-specific), +9.7% codegen 36.6% API calls vs debate Team opt is coarse, agent pool static

All systems report accuracy increases alongside resource efficiency or scalability improvements when dynamic, feedback-driven, or meta-learned agent selection is employed at inference (Stein et al., 14 Nov 2025, Agrawal et al., 3 May 2025, Chen et al., 30 Sep 2025, Jia et al., 18 Nov 2025, Parmar et al., 22 Feb 2025, Liu et al., 2023). Limitations include reliance on ground-truth verifiers, zero-shot prompt engineering in new domains, cold-start memory issues, and the lack of fully end-to-end trainable selectors in some modular designs.

6. Extensions and Future Research Directions

Promising directions, as identified across recent work, include:

  • Generalization to multi-agent/multi-model scenarios: Extending inference-time selection to arbitrary orchestration over both models and agents, as in MoMA’s mixture-of-experts context-aware router (Guo et al., 9 Sep 2025).
  • Fine-grained feedback and RL-driven optimization: Replacing hard binary feedback with scalar LLM critics and incorporating policy gradient or actor-critic signals in the selection policy update loop (Stein et al., 14 Nov 2025, Agrawal et al., 3 May 2025).
  • Structural indexing and faster retrieval: Large-scale experience memories may be clustered, vector-indexed, and sharded for low-latency agent recall (Nizar et al., 22 Nov 2025).
  • Adaptive stateful search: Persistent inference-time state in evolutionary search controllers (i.e., non-Markovian, archive-aware) is critical for deep coverage and robustness (Lalan et al., 8 Oct 2025).
  • Plug-and-play extensibility: Modular frameworks accommodate rapid agent registration/removal, dynamic routing strategies, and robust masking to prevent invalid invocation (Guo et al., 9 Sep 2025, Nizar et al., 22 Nov 2025).
  • Online-to-offline transfer: Pretraining Guide modules on synthetic corpora before online adaptation, or continuous team re-optimization (Stein et al., 14 Nov 2025, Liu et al., 2023).

7. Comparative Perspective and Best Practices

Across the literature, inference-time agent selection subsumes and extends static prompt engineering, “text steering,” ensemble majority-vote, and offline meta-learning:

  • Static heuristics relax only input prompts but cannot adapt control flow, sampling, or tool configuration (Stein et al., 14 Nov 2025).
  • Memory-based text steering enables limited prompt augmentation but cannot revise agent logic (Stein et al., 14 Nov 2025).
  • Graph-based structural selection (Agent-as-a-Graph, AutoTool) achieves fine-grained tool/agent matching by explicit encoding of agent–tool relationships and observed transition patterns, outperforming vector-only or stateless retrieval approaches (Nizar et al., 22 Nov 2025, Jia et al., 18 Nov 2025).
  • Bandit and UCB-based selectors provide a principled balance of exploration and exploitation, integrating LLM priors and empirical runtime feedback (Parmar et al., 22 Feb 2025).
  • Unsupervised backward aggregation and peer rating (DyLAN) offer a scalable, label-free mechanism for quantifying agent contributions in multi-agent architectures (Liu et al., 2023).

Practitioners are advised to combine persistent memory, meta-strategy induction, dynamic feedback integration, and fine-grained retrieval or bandit algorithms to maximize both adaptivity and efficiency in deployment, with careful attention to verifier accuracy and domain-specific feedback design (Stein et al., 14 Nov 2025, Chakraborty et al., 2 Apr 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Inference-Time Agent Selection.