Arbor: Autonomous Optimization Framework
- Arbor is a framework for autonomous research that uses Hypothesis Tree Refinement to guide cumulative, globally coordinated optimization.
- It formalizes research as a tuple (M₀, O, E_dev, E_test), iteratively refining artifacts via a coordinator–executor architecture and robust held-out admission.
- Empirical results show that Arbor outperforms baselines in tasks like optimizer design and data synthesis, demonstrating efficiency and scalability.
Arbor is a generalist framework for autonomous research and optimization based on the principle of Hypothesis Tree Refinement (HTR): a persistent, tree-structured memory that connects hypotheses, artifacts, empirical evidence, and distilled semantic insights across long research horizons. Developed in the context of agentic AI for scientific and engineering tasks, Arbor transforms traditional sequential local search into a cumulative, globally coordinated process where learned lessons, both positive and negative, propagate persistently across time and across the entire space of attempted modifications and strategies. The design supports the full automation of open-ended research workflows, from strategy generation and experiment execution to evidence integration and robust selection of improvements, without requiring step-level human supervision (Jin et al., 10 Jun 2026).
1. Formalization of Autonomous Optimization in Arbor
Arbor formalizes an autonomous research or optimization problem as the tuple
where is the initial artifact (e.g., codebase, harness, dataset), is the objective (e.g., maximize pass rate, minimize loss), provides development-time feedback for use during the search, and is a fully held-out evaluator invoked only for terminal selection.
The agent’s goal is to generate a set of candidate artifacts , evaluate them using only development feedback for search, and return
where . Each iteration produces a new artifact via
with update policy driven by the current artifact 0 and development evidence 1.
This setting generalizes both classical black-box optimization and experiment-driven research. Unlike local search, which typically only maintains the current artifact, Arbor persists a hierarchical state over all attempted directions and their empirical outcomes, enabling structured experimentation and long-horizon information reuse.
2. Hypothesis Tree Refinement (HTR): Structural Memory and Evidence Flow
Central to Arbor is persistent Hypothesis Tree Refinement (HTR), serving as dynamic memory and strategic backbone throughout the autonomous workflow. Every node 2 in the tree is a tuple
3
where 4 is the hypothesis (natural language description of the intended modification), 5 is a semantic insight distilled from the resulting evidence (success, failure, boundary found, etc.), and 6 contains context metadata (status, code ref, performance metrics).
The tree evolves via coordinated node expansions (proposing new hypotheses under current frontier nodes), executor runs (implementing each hypothesis in isolation and returning structured evidence tuples), and recursive upward abstraction: 7 At every step, new evidence is written to the respective node, reviewed to propagate reusable semantic lessons up the tree, and used to prune or prioritize branches. This enables both breadth-first ("broad mechanism") and depth-first ("boundary probing") exploration and ensures that lessons are retained and reused across distant subareas of the search space.
3. Coordinator–Executor Architecture and Algorithmic Loop
Arbor separates the system into a long-lived coordinator (managing global research state, selection, and memory) and many short-lived executors (tasked with operationalizing and empirically testing specific hypotheses). The coordinator iteratively proceeds through:
- Observe: Survey current tree, identify the search frontier and collect synthesized lessons.
- Ideate: For each parent in the frontier, generate candidate child hypotheses, potentially using insights from ancestors and siblings.
- Select: Choose pending nodes to dispatch, based on prioritization policies.
- Dispatch: Launch executor instances in parallel, passing each the precise hypothesis, aggregated insights, 8, and the current best artifact.
- Backpropagate: Integrate new evidence—development score, artifact, insight—into the tree; abstract lessons upward.
- Decide: Evaluate test-gated merge criteria (e.g., if test score improves upon the incumbent, admit the improvement), prune branches contradicted by the latest evidence, and update the search budget.
Executors are strictly constrained: they only read ancestor insights, cannot view sibling solutions, use solely 9 for development, and never directly admit branches without the coordinator’s gate; this strictly enforces the dev/test separation central to robust AO.
4. Held-Out Admission, Robustness, and Overfitting Prevention
A defining feature of Arbor is its robust admission gate: only improvements that demonstrably improve the held-out test metric are merged into the "best" artifact, even if development feedback suggested large local gains. Empirically, this discipline prevents overfitting to development artifacts and ensures genuine generalization. In tasks such as Terminal-Bench 2.0 and BrowseComp, Arbor's test gains (e.g., +7.55% and +22.34% absolute over baseline, respectively) notably exceeded those of Codex and Claude Code, even when the latter achieved higher development scores, attributing to Arbor’s explicit dev/test protocol (Jin et al., 10 Jun 2026).
This approach also supports principled pruning: tree branches whose descendant insights indicate no likely path to test improvement are pruned, rate-limiting resource expenditure on unpromising research directions.
5. Empirical Results and Task Coverage
Arbor has demonstrated state-of-the-art Autonomous Optimization performance across six open real-world research tasks, including model training (optimizer/architecture design), harness engineering (complex shell and browser agents), and data synthesis (search-agent and math-reasoning datasets). On each, Arbor achieved the best held-out result among all compared agentic frameworks. Notably, its relative held-out gain averaged more than 2.5× above competitors under equivalent resource budgets, as summarized in the following excerpted results:
| Task (Metric) | Baseline | Codex | Claude Code | Arbor |
|---|---|---|---|---|
| Optimizer Design (steps ↓) | 3325 | 3275 | 3287.5 | 3237.5 |
| Terminal-Bench (pass ↑) | 69.81% | 73.59% | 71.70% | 77.36% |
| BrowseComp (acc ↑) | 45.33% | 50.00% | 53.33% | 67.67% |
| SearchAgent DataSynth (gap ↑) | 5.00 | 9.00 | 12.00 | 18.00 |
| MathReason DataSynth (gap ↑) | 1.04 | 6.25 | 8.33 | 20.83 |
Further, Arbor reached 86.36% “Any-Medal” on MLE-Bench Lite (with GPT-5.5), a leading result within the protocol (Jin et al., 10 Jun 2026).
Ablation analyses revealed the necessity of both persistent structural memory and upward lesson propagation: removing only feedback drops performance to 54.54% (Any Medal); removing only the tree (i.e., flat search) yields 63.64%. The synergy of both hierarchy and semantic abstraction is critical for robust AO.
6. Design Principles, Scalability, and Strategic Insights
Key elements underlying Arbor’s effectiveness include:
- Cumulative state: Persistent tree structures accumulate both successful and failed experiments, supporting long-horizon and non-myopic search.
- Semantic lesson propagation: Recursive evidence abstraction ensures that both fine-grained tuning and high-level strategic mechanisms are efficiently reused across many generations.
- Cost-awareness: Token consumption (20–43M tokens across major experiments) is managed so as to prioritize valuable hypothesis expansions and controlled executor fanout, making scaling feasible for large, resource-intensive tasks.
- Robust AO via held-out gates: The GitMergeBranch (test-only promotion) enforces research integrity by guaranteeing that empirically admitted improvements generalize beyond agentic or search-specific artifacts.
Together, these design choices position Arbor as an operational blueprint for generalist autonomous research and optimization, well-suited to diverse scientific and engineering domains requiring auditable, long-horizon, and semantically informed experimentation.
7. Outlook and Implications for Autonomous Research
The Arbor framework articulates a scalable and general protocol for AO: by formalizing tasks as tuples (0), maintaining durable hypothesis trees, separating roles for coordination and execution, and enforcing evidence-based admission, it unifies experiment planning, semantic reasoning, and robust optimization. Empirical results confirm that this approach materially accelerates discovery, increases sample efficiency, and improves generalization versus standard agentic baselines.
A plausible implication is that future autonomous research paradigms, especially those involving complex, open-ended scientific or engineering landscapes, will increasingly rely on persistent structural memories (such as HTR) and evidence-propagation mechanisms to realize tractable, robust, and explainable research workflows (Jin et al., 10 Jun 2026).