Inference-Time Agent Selection
- Inference-time agent selection is a dynamic framework that chooses and adapts agentic policies during inference rather than relying on static, pre-trained models.
- It utilizes meta-strategy induction, dynamic memory integration, and bandit algorithms to select the best ensemble of agents based on real-time feedback and online experience.
- This approach enhances accuracy, reduces computational cost, and improves adaptability across diverse AI tasks and problem domains.
Inference-time agent selection refers to a class of methodologies and mechanisms that select, compose, or adapt agentic policies or entire computational strategies—such as LLM calls, tool chains, or hybrid workflows—dynamically at inference, rather than statically at train time. This enables AI systems to optimize for correctness, efficiency, and adaptability in response to novel inputs and accumulation of online experience. Unlike model fine-tuning or offline retraining, inference-time selection operates over ensembles of agents, strategies, or algorithms and leverages dynamic evaluation, meta-reasoning, or reinforcement, often with persistent memory and feedback integration. State-of-the-art approaches span domains from LLM meta-strategy learning to graph-based tool routing.
1. Formalization and Canonical Architectures
The core abstraction is a meta-strategy or selection policy
where is the task/query space, is system memory (e.g., history, experience buffer), and is the space of candidate agents or strategies. For a given and experience , the policy returns a distribution (or deterministic selection) over . Key architectures employing this formalism include:
- EGuR (Experience-Guided Reasoner) frames agent selection as meta-strategy induction, where an LLM-based Guide generates a small batch of candidate strategies conditioned on , and a Consolidator updates with execution outcomes, enabling online synthesis of hybrid procedures involving prompts, sampling parameters, tool configurations, and control logic (Stein et al., 14 Nov 2025).
- MetaOrch formalizes agent selection in multi-agent systems as a neural classification problem with a supervised orchestrator network that maps contextualized task-agent representations to selection scores, leveraging soft (“fuzzy”) evaluation signals during training (Agrawal et al., 3 May 2025).
- PlanGEN employs a modular selection agent built around a modified UCB bandit rule, blending LLM-guided priors and empirical rewards for pipeline-level algorithm selection (e.g., Best-of-N, Tree-of-Thought) in planning and reasoning domains (Parmar et al., 22 Feb 2025).
- DyLAN derives “Agent Importance Scores” by propagating peer ratings through the agent network in an unsupervised backward aggregation, enabling unsupervised agent selection based on trial-based contribution metrics (Liu et al., 2023).
2. Algorithms and Selection Criteria
Inference-time agent selection employs a spectrum of algorithms tailored to the granularity and setting:
- Population-based/batch selection (EGuR, TUMIX): A meta-policy generates candidate strategies for the current task, executes them in parallel, and applies an explicit reward- or feedback-based selection rule. In EGuR, relative “Group Relative Policy Optimization” credit is assigned using correctness/cost feedback:
The Guide module is conditioned on retrieved prior strategies and notes from memory, enabling adaptation over the entire strategy space (Stein et al., 14 Nov 2025).
- Fuzzy and listwise strategies (MetaOrch): Candidate agents’ output quality is evaluated along completeness, relevance, and confidence axes, yielding a soft label which supervises a neural selector. At inference, selection is by highest predicted suitability (Agrawal et al., 3 May 2025).
- Bandit and UCB-based adaptive selection (PlanGEN): The agent’s action is
balancing exploitation, exploration, diversity, recovery after bad runs, and LLM-estimated suitability (Parmar et al., 22 Feb 2025).
- Graph-based and structured retrieval (Agent-as-a-Graph, AutoTool): Agent or tool selection is realized as node retrieval in a bipartite knowledge graph indexed with embeddings, optionally augmented by type-aware weighted reciprocal rank fusion and explicit tool-ownership edge traversals for agent disambiguation (Nizar et al., 22 Nov 2025). In tool-focused settings, AutoTool leverages empirically observed tool usage inertia, traversing a learned tool/parameter transition graph to minimize LLM query overhead (Jia et al., 18 Nov 2025).
3. Dynamic Memory, Feedback Integration, and Online Learning
Inference-time selection frameworks typically maintain non-ephemeral state across problem instances, accumulating and abstracting execution traces, feedback, and heuristic notes. Representative implementations include:
- EGuR’s memory comprising a strategy library mapping problem signatures (e.g., embeddings) to effective strategies, and general notes of operational heuristics. The experience memory is updated post-hoc by the Consolidator, which synthesizes abstracted rules for future Guide conditioning (Stein et al., 14 Nov 2025).
- Persistent state in evolutionary search captures the growing set of proposals, coverage statistics, robustness signals, and rewards. The controller conditions agent selection and generation on the full historical state rather than only recent context, avoiding the Markov limitation of naive ablation (Lalan et al., 8 Oct 2025).
- Verifier-guided iterative evaluation (IAD): Systems like IAD dynamically interleave proposal generation, reward evaluation, and feedback-based prompt construction, explicitly conditioning further iterations on both the best and worst previously observed responses, enforcing non-decreasing solution quality (Chakraborty et al., 2 Apr 2025).
Memory curation usually involves abstraction and selective retention, discarding stale or infrequently retrieved entries, and (in EGuR) caching effective strategies for future retrieval.
4. Impact of Agent Diversity, Ensemble Strategies, and Early Stopping
Several empirical studies converge on the importance of maintaining high agent diversity and dynamic ensemble selection:
- TUMIX demonstrates superior coverage and accuracy scaling with a mixture of 15–30 agents, each employing distinct tool/prompt/logic combinations. LLM-driven auto-design further amplifies diversity, with random subset selection and majority-voting delivering up to 3.55% accuracy gains over best single-agent/ensemble baselines (Chen et al., 30 Sep 2025).
- Majority-vote aggregation, halting by LLM-as-Judge, and ensemble pruning emerge as recurring mechanisms: TUMIX leverages a judge agent for dynamic early stopping, reducing inference cost to 49% of fixed-round baselines without loss of performance. Ablations confirm that higher agent modality diversity (e.g., text+code+search) yields additive gains, and that scaling number of unique agent types saturates beyond 12–15 (Chen et al., 30 Sep 2025).
- Dynamic selection versus static ensembles: Static heuristic ensembles (e.g., single prompt per tool) cannot adapt or improve from experience, whereas selection policies that leverage memory, online scoring, and dynamic feedback achieve both higher mean performance and steeper improvement curves as experience accumulates (EGuR, TUMIX) (Stein et al., 14 Nov 2025, Chen et al., 30 Sep 2025).
5. Empirical Performance and Limitations
Key quantitative results, along with system limitations, are as follows:
| System | Main Task / Domain | Accuracy Improvement | Cost Reduction | Key Limitations |
|---|---|---|---|---|
| EGuR | 3-SAT, AIME, BBH | 14% rel. ↑ (3-SAT 96% vs 77%) | up to 111× | Needs oracle verifier; zero-shot LLM prompts |
| MetaOrch | Simulated MAS domains | 86.3% selection accuracy | — | Needs fuzzy evaluation; specialist–generalist ambiguity |
| TUMIX | Reasoning, hybrid QA | up to +3.55% (ensemble) | 51% cost saved | Diminishing returns >15 agents |
| AutoTool | ScienceWorld, AlfWorld | 0.394→0.531 (AlfWorld PR) | 10–40% token-in, 30–65% out | Cold start, open-ended tasks less effective |
| PlanGEN SelectionAgent | Calendar, PlanBench, QA | +3–5 EM points | — | No end-to-end training; UCB hyperparameters by grid search |
| DyLAN | MMLU, HumanEval | up to +25.0% (subject-specific), +9.7% codegen | 36.6% API calls vs debate | Team opt is coarse, agent pool static |
All systems report accuracy increases alongside resource efficiency or scalability improvements when dynamic, feedback-driven, or meta-learned agent selection is employed at inference (Stein et al., 14 Nov 2025, Agrawal et al., 3 May 2025, Chen et al., 30 Sep 2025, Jia et al., 18 Nov 2025, Parmar et al., 22 Feb 2025, Liu et al., 2023). Limitations include reliance on ground-truth verifiers, zero-shot prompt engineering in new domains, cold-start memory issues, and the lack of fully end-to-end trainable selectors in some modular designs.
6. Extensions and Future Research Directions
Promising directions, as identified across recent work, include:
- Generalization to multi-agent/multi-model scenarios: Extending inference-time selection to arbitrary orchestration over both models and agents, as in MoMA’s mixture-of-experts context-aware router (Guo et al., 9 Sep 2025).
- Fine-grained feedback and RL-driven optimization: Replacing hard binary feedback with scalar LLM critics and incorporating policy gradient or actor-critic signals in the selection policy update loop (Stein et al., 14 Nov 2025, Agrawal et al., 3 May 2025).
- Structural indexing and faster retrieval: Large-scale experience memories may be clustered, vector-indexed, and sharded for low-latency agent recall (Nizar et al., 22 Nov 2025).
- Adaptive stateful search: Persistent inference-time state in evolutionary search controllers (i.e., non-Markovian, archive-aware) is critical for deep coverage and robustness (Lalan et al., 8 Oct 2025).
- Plug-and-play extensibility: Modular frameworks accommodate rapid agent registration/removal, dynamic routing strategies, and robust masking to prevent invalid invocation (Guo et al., 9 Sep 2025, Nizar et al., 22 Nov 2025).
- Online-to-offline transfer: Pretraining Guide modules on synthetic corpora before online adaptation, or continuous team re-optimization (Stein et al., 14 Nov 2025, Liu et al., 2023).
7. Comparative Perspective and Best Practices
Across the literature, inference-time agent selection subsumes and extends static prompt engineering, “text steering,” ensemble majority-vote, and offline meta-learning:
- Static heuristics relax only input prompts but cannot adapt control flow, sampling, or tool configuration (Stein et al., 14 Nov 2025).
- Memory-based text steering enables limited prompt augmentation but cannot revise agent logic (Stein et al., 14 Nov 2025).
- Graph-based structural selection (Agent-as-a-Graph, AutoTool) achieves fine-grained tool/agent matching by explicit encoding of agent–tool relationships and observed transition patterns, outperforming vector-only or stateless retrieval approaches (Nizar et al., 22 Nov 2025, Jia et al., 18 Nov 2025).
- Bandit and UCB-based selectors provide a principled balance of exploration and exploitation, integrating LLM priors and empirical runtime feedback (Parmar et al., 22 Feb 2025).
- Unsupervised backward aggregation and peer rating (DyLAN) offer a scalable, label-free mechanism for quantifying agent contributions in multi-agent architectures (Liu et al., 2023).
Practitioners are advised to combine persistent memory, meta-strategy induction, dynamic feedback integration, and fine-grained retrieval or bandit algorithms to maximize both adaptivity and efficiency in deployment, with careful attention to verifier accuracy and domain-specific feedback design (Stein et al., 14 Nov 2025, Chakraborty et al., 2 Apr 2025).