Bilevel Optimization of Agent Skills via Monte Carlo Tree Search

Published 17 Apr 2026 in cs.AI | (2604.15709v1)

Abstract: Agent \texttt{skills} are structured collections of instructions, tools, and supporting resources that help LLM agents perform particular classes of tasks. Empirical evidence shows that the design of \texttt{skills} can materially affect agent task performance, yet systematically optimizing \texttt{skills} remains challenging. Since a \texttt{skill} comprises instructions, tools, and supporting resources in a structured way, optimizing it requires jointly determining both the structure of these components and the content each component contains. This gives rise to a complex decision space with strong interdependence across structure and components. We therefore represent these two coupled decisions as \texttt{skill} structure and component content, and formulate \texttt{skill} optimization as a bilevel optimization problem. We propose a bilevel optimization framework in which an outer loop employs Monte Carlo Tree Search to determine the \texttt{skill} structure, while an inner loop refines the component content within the structure selected by the outer loop. In both loops, we employ LLMs to assist the optimization procedure. We evaluate the proposed framework on an open-source Operations Research Question Answering dataset, and the experimental results suggest that the bilevel optimization framework improves the performance of the agents with the optimized \texttt{skill}.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a bilevel optimization framework using MCTS to separate discrete structural decisions from content refinement.
It employs LLM-guided techniques to propose and evaluate modifications, yielding statistically significant improvements on the ORQA benchmark.
The approach demonstrates robust skill engineering by effectively managing LLM stochasticity and optimizing high-dimensional agent configurations.

Bilevel Optimization of Agent Skills via Monte Carlo Tree Search: Technical Summary

Problem Formulation and Motivation

The paper "Bilevel Optimization of Agent Skills via Monte Carlo Tree Search" (2604.15709) introduces a systematic approach to optimizing LLM agent skills—structured bundles of instructions, scripts, metadata, and supporting resources that define agent behavior for particular task classes. The heterogeneous, interdependent, and combinatorial nature of skill design precludes straightforward optimization: structural decisions about included components interact tightly with the natural language and code that form their content, and the decision space is large and highly nonconvex.

The authors formalize skill optimization as a bilevel optimization problem, where the outer loop searches over discrete skill structures (directory layout, component presence/order), and the inner loop optimizes the content instantiations (instructions, scripts, references, assets) within any fixed structure. Both loops must address unstructured, discrete variables under practical constraints (spec compliance, token budgets). This hierarchical separation enables more efficient and targeted search by decomposing high-level architectural design from low-level content engineering.

Figure 1: Bilevel Optimization Architecture for Agent Skill Modules.

Methodological Framework

Bilevel Optimization Procedure

Skill Representation: Skills are formalized as tuples $S=(\theta, \phi)$ , with $\theta$ encoding discrete structural decisions and $\phi$ representing all content compatible with $\theta$ (e.g., instruction Markdown, routing metadata, scripts).
Outer Loop (Structure Search): Employs Monte Carlo Tree Search (MCTS) to sequentially explore structure configurations. Each tree node encodes a candidate $\theta$ , with actions in the space of structural edits (add, remove, reorganize component types, etc.).
Inner Loop (Content Refinement): For a structure $\theta'$ , content $\phi_0(\theta')$ is computed via a "bridge" operation that repurposes reusable information from parent nodes. Refinement then occurs within a bounded budget, using family-specific routines matched to the edit type (e.g., instruction rewriting, script adjustment, section triage).
Evaluation: Each structure-content pair $(\theta', \phi')$ is evaluated via downstream task execution (LLM agent on target benchmarks), with results propagated up the MCTS tree (reward, diagnostic metrics).

This explicit separation provides explicit attribution of evaluation feedback to structural revisions and allows conservative, pessimistic content selection to mitigate the impact of LLM stochasticity and noisy evaluations.

Figure 2: Illustration of MCTS search tree for Configuration B in the ORQA experiment. Nodes are labeled by the structure action and the resulting reward on search split. The winning path and selected node are highlighted in blue, while weaker alternatives are shown in a faded style.

LLM-Guided Operations

Both the structural search and content refinement procedures leverage LLMs (openai/gpt-5.4 for orchestration, openai/gpt-5.2-codex for runtime evaluation) as meta-reasoners. MCTS expansion and action proposals involve multi-stage LLM analysis, combining current structural state, search history, evaluation diagnostics, and domain-specific skill priors to guide edits. Similarly, content refinement routines use LLM generations to propose, compare, and select among content modifications.

Search Space Prioritization

An initial "comprehension" phase parses a seed skill, extracts its hierarchical structure, and builds a skill profile $\mathcal{P}$ , which specifies intended function, success criteria, promising revision axes, and a search prior $\widetilde{\Theta}(\mathcal{P})$ . This enables focused outer-loop search, reducing the otherwise intractable combinatorial burden.

Empirical Evaluation: Optimization on Operations Research QA

The framework is validated on the Operations Research Question Answering (ORQA) benchmark, which requires classifying and reasoning over complex optimization problems framed as MCQs. The seed skill is a two-file package created by an LLM-based skill generator, comprising a SKILL.md (instructions, frontmatter) and a separate reference file.

Experimental Protocol

The ORQA dataset is split into search, confirm, and test cohorts to enable optimization, model selection, and unbiased evaluation.
Two MCTS configurations are compared: a conservative (low-iteration, deterministic UCB1 selection) and an exploratory (high-iteration, mixed-probability with entropy regularization), each with different settings for convergence and admissible actions.
The experiment preserves fixed LLMs, agent settings, and token budgets per pipeline stage.

Quantitative Results

Local Maximum: Both configurations achieve a peak search split reward of $\theta$ 0.
Confirm Split: Configuration B (exploratory) yields a higher confirmation mean score ( $\theta$ 1 versus $\theta$ 2).
Held-Out Test Improvement: The baseline (seed) skill scores $\theta$ 3 on test split; the optimized skill achieves $\theta$ 4, for an absolute improvement of $\theta$ 5.

This result reflects a statistically significant yet moderate improvement over the already strong baseline, demonstrating the practical value of structured skill optimization.

Qualitative Analysis

Structural modifications by the outer loop are synergistic with inner-loop content adjustment:

The winning skill consolidates essential reference guidance (e.g., question-type triage) from a separate file into SKILL.md, reducing fragmentation and improving agent access to crucial instructions.
New sections (explicit triage checklist) and frontmatter clarifications reduce ambiguity and support better error-checking.
Refined content imposes stricter output formatting and more explicit elimination heuristics, reducing agent confusion and improving answer reliability.

(Figure 2), which visualizes the MCTS search tree, shows that several branching strategies were considered and pruned, with the design converging on the structure-content configuration that maximized empirical downstream task performance.

Theoretical and Practical Implications

The methodology has several implications:

Generalizability: The bilevel decomposition applies to arbitrary agent skill definitions, orthogonal to the specific content language or task substrate.
Adaptability: The coupling of LLM-guided proposal and evaluation with MCTS enables dynamic exploration/exploitation scheduling, modulated for reward noise and search depth.
Skill Engineering as Discrete Optimization: The explicit handling of structure-content interdependence enables direct optimization of agent-level artifacts, which extends prior art in code-based workflow tuning and LLM tool selection.
Robustness: Conservative (pessimistic) return criteria for content refinement help control LLM stochasticity and evaluation noise—a significant challenge in artifact-level optimization.

Directions for Future Research

Potential future extensions include:

Integrating richer simulation optimization methods for reward estimation and uncertainty quantification, beyond basic lower confidence bounds.
Scaling to multi-agent, multi-skill settings, where dependencies and interoperability between skills must be optimized jointly.
Incorporating adversarial or safety-oriented constraints, especially in settings where security and malicious skill injection are concerns.
Leveraging active learning and meta-optimization over search priors, particularly as skill benchmarks (e.g., SkillsBench) are further developed.

Conclusion

This work provides a rigorous framework for LLM agent skill optimization, formulating the problem as a bilevel search over combinatorial structure spaces and high-dimensional content instantiations. By employing MCTS in the outer loop and LLM-guided, family-specific content refinement in the inner loop, the proposed method enables automated synthesis of high-quality, task-specialized skill packages. Empirical results on ORQA validate the efficacy of this approach, achieving measurable improvements over LLM-generated seed skills. The framework sets the stage for more principled, scalable agent skill engineering and has broad applicability across domains where LLM agents interact with structured artifact ecosystems.

Markdown Report Issue