Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 79 tok/s

Gemini 2.5 Pro 57 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 39 tok/s Pro

GPT-4o 109 tok/s Pro

Kimi K2 197 tok/s Pro

GPT OSS 120B 453 tok/s Pro

Claude Sonnet 4.5 38 tok/s Pro

2000 character limit reached

Behavior Best-of-N (bBoN) Framework

Updated 3 October 2025

Behavior Best-of-N (bBoN) is a framework that expands the solution space by generating diverse candidate trajectories and employs narrative evaluation to select the optimal outcome.
It converts detailed action sequences into interpretable behavior narratives using vision-language models to extract task-relevant facts for robust comparison.
The approach achieves state-of-the-art performance on benchmarks like OSWorld by combining broad behavioral exploration with structured, comparative judging.

Behavior Best-of-N (bBoN) is a scaling and decision framework for selecting among multiple generated candidate solutions—trajectories, action sequences, or rollouts—produced by computer-use agents (CUAs) in interactive environments, particularly those involving complex, long-horizon digital tasks. The approach expands the solution space by sampling multiple diverse agent behaviors, then employs structured, narrative-centered evaluation mechanisms to select the candidate that best fulfills the task objective. This strategy dramatically increases robustness and success rates in high-variance environments with complex chains of state transitions. bBoN establishes new state-of-the-art success rates on benchmarks such as OSWorld by combining wide behavioral exploration with principled, interpretable selection, approaching the reliability of advanced human users (Gonzalez-Pumariega et al., 2 Oct 2025).

1. Motivation and Definition

The motivation for Behavior Best-of-N arises from the high error rates, brittleness, and variance seen in contemporary CUAs when challenged with long-horizon, real-world tasks. Traditional approaches that execute a single agent rollout are susceptible to cascading errors and exhibit limited robustness in the face of partial observability, UI complexity, and stochastic event chains. bBoN reframes the agent execution problem as a “behavioral selection” task: rather than relying on any single agent-generated solution, the method generates multiple independent candidate rollouts and applies systematic, narrative-based selection to identify the candidate most likely to satisfy the end-user’s instruction.

Formally, from a pool $C = \bigcup_{m=1}^{M} \{ \tau_m^{(n)} \}$ of candidate trajectories, bBoN seeks to select

$\hat{\tau} \in \arg\max_{\tau \in C} R(\tau, I)$

where $R(\tau, I)$ is the environment's reward function for trajectory $\tau$ under instruction $I$ . The reward is estimated not directly from the dense transitions but from a compacted summary known as the behavior narrative.

2. Behavior Narrative Generation

A key innovation in bBoN is the transformation of dense, stepwise agent trajectories into behavior narratives—succinct and interpretable summaries of the behavioral effect of each action. For a sequence $\tau = (s_0, a_1, s_1, \ldots, a_{t-1}, s_t)$ , the process involves:

Using a vision-language foundation model to process each triple $(s_i, a_i, s_{i+1})$ (where $s_i$ are screenshots and $a_i$ are agent actions).
Extracting task-relevant “facts” $\phi_i$ denoting observable, meaningful state changes induced by actions—e.g., file opened, window closed, menu navigated.
Aggregating the initial state, the sequence of $\phi_i$ , and final state into the behavior narrative $\tilde{\tau} = (s_0, \phi_0, \phi_1, \ldots, \phi_{t-1}, s_t)$ .

Implementation details include targeted visual augmentations (e.g., marking click sites, zoomed crops of post-action states) to structure input to the vision-LLM for precise fact extraction. This compact representation removes irrelevant noise (such as unchanged regions in a GUI) and preserves only salient state transitions, enabling scalable and interpretable downstream evaluation.

3. Candidate Selection via Comparative Judging

Selection among candidate narratives is performed using a comparative evaluation mechanism designed to prioritize trajectories that best demonstrate task completion and robustness:

All candidate behavior narratives for a given instruction are presented together to an automated judge (typically an LLM) in a multiple-choice question (MCQ) format.
The judge is prompted to compare the set of narratives holistically, citing evidence from the action–effect “facts” when deciding which candidate most faithfully and completely achieves the task.

This comparative, context-sensitive selection avoids the pitfalls of isolated, independent scoring—enabling the judge to focus on relative strengths and partial completions across candidates, and to synthesize evidence when individual rollouts succeed on complementary subtasks.

This contrasts with baseline approaches that attempt to rank or score agent rollouts in isolation—an approach found to be less reliable, especially as the task complexity or the number of candidate solutions increases.

4. Benchmark Performance and Empirical Results

bBoN demonstrates significantly improved performance across multiple complex interactive computing environments:

OSWorld (Ubuntu interface): Achieves a new state-of-the-art 100-step task success rate of 69.9%, exceeding the previous best (59.9%) and approaching human capability (72%).
WindowsAgentArena: Delivers a 6.4% absolute improvement over the Agent S3 baseline in a zero-shot generalization setting.
AndroidWorld: Yields a 3.5% improvement over prior screenshot-only methods.

These gains are attributed to the method’s effective combination of broad sample diversity (“wide scaling”) and narrative-based comparative evaluation, which reliably surfaces partial or full successes that may be missed by any single-model agent. The performance generally improves monotonically with the number of candidate rollouts, underscoring the effectiveness of wide exploration coupled with structured aggregation.

5. Design Choices and Ablation Analysis

Empirical ablations isolated several key contributors to bBoN’s effectiveness:

Behavior narrative vs. naive representations: Ablations show that converting trajectories to compact narratives (as opposed to naive captioning or screenshot-only approaches) yields several percentage points of success rate gain.
Selection mechanism—MCQ vs. independent ranking: Comparative MCQ-style selection outperforms independent ranking/scoring, with superior scalability (as the number of candidates grows) and accuracy in complex, multi-step tasks.
Citation of narrative evidence: Prompting the judge to explicitly refer to narrative “facts” during evaluation confers a further (albeit smaller) benefit, attributed to increased interpretability and error analysis.
Agent framework improvements: Integration of an improved agent base (e.g., Agent S3, which brings better state representation and search primitives) coupled with bBoN further boosts robustness.

6. Generalization and Scalability

bBoN generalizes effectively across operating systems (Linux, Windows, Android) and task domains, reflecting its independence from specific UI conventions or platform idiosyncrasies:

The modular separation of trajectory generation, narrative extraction, and comparative selection enables portability.
The approach’s “wide scaling” element—increasing candidate pool size instead of model size—yields gains even with relatively modest base agents.
Structured candidate selection ensures that increased proposal diversity translates into tangible performance improvement, rather than simply amplifying variance.

7. Implications, Limitations, and Future Directions

The bBoN framework illustrates that, for high-stakes, high-variance long-horizon agentic tasks, broad behavioral exploration combined with structured, narrative-driven candidate selection dramatically increases robustness and reliability. The reliance on fact-based summarization and comparative judging enables interpretable selection mechanisms—for potential downstream uses in explanation, debugging, and semi-automated human-in-the-loop deployment.

A plausible implication is that similar narrative-based, comparative aggregation approaches may extend to other high-dimensional behavioral settings, including multi-agent scenarios, simulation environments, or continuous control domains.

Some limitations include the cost of high compute during candidate generation, the fidelity of the narrative extraction model (which must accurately capture action–effect semantics), and the risk of judge errors when narratives are ambiguous or incomplete. Future research directions include:

Application to real-world desktop environments with shared resources and side effects.
Enhancements to the narrative extraction pipeline (e.g., better visual-language alignment).
Augmentation of the judge with richer reasoning capabilities.
Dynamic allocation of sample budgets or adaptive candidate selection strategies to further trade off compute and task performance.

bBoN provides a scalable, practical, and interpretable blueprint for robust decision making in complex computer-use agent settings, with demonstrated generalization and near-human performance on challenging digital tasks (Gonzalez-Pumariega et al., 2 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

The Unreasonable Effectiveness of Scaling Agents for Computer Use (2025)

Follow Topic

Get notified by email when new papers are published related to Behavior Best-of-N (bBoN).

Behavior Best-of-N (bBoN) Framework

1. Motivation and Definition

2. Behavior Narrative Generation

3. Candidate Selection via Comparative Judging

4. Benchmark Performance and Empirical Results

5. Design Choices and Ablation Analysis

6. Generalization and Scalability

7. Implications, Limitations, and Future Directions

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Behavior Best-of-N (bBoN) Framework

1. Motivation and Definition

2. Behavior Narrative Generation

3. Candidate Selection via Comparative Judging

4. Benchmark Performance and Empirical Results

5. Design Choices and Ablation Analysis

6. Generalization and Scalability

7. Implications, Limitations, and Future Directions

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research