OctoBench: Dual Benchmarking Frameworks
- OctoBench denotes two distinct frameworks: one evaluating scaffold-aware agentic coding and another benchmarking optical turbulence predictions with standard error metrics.
- The coding framework uses Docker environments and a checklist-based protocol to measure multi-turn, scaffold compliance, revealing a gap between average compliance and perfect adherence.
- The optical turbulence framework standardizes model evaluation using datasets, regression tasks, and metrics like RMSE and MAE to ensure reproducible, comparative analyses.
OctoBench denotes two large-scale benchmarking frameworks: one for the evaluation of scaffold-aware instruction following in repository-grounded agentic coding, and another for the principled comparison of models in optical turbulence strength prediction. Despite a shared name, these frameworks serve distinct research objectives. The following entry delineates both systems, referencing their technical architectures, benchmark protocols, and implications for methodological advancement in their respective domains (Ding et al., 15 Jan 2026, Jellen et al., 2024).
1. Benchmarking Scaffold-Aware Agentic Coding: System Design and Purpose
The OctoBench framework for repository-grounded agentic coding targets the measurement of instruction-following capabilities for LLM-based software agents operating within persistent scaffolds. This evaluation addresses a known deficit in current benchmarks, which focus primarily on single-turn, explicit instructions or outcome correctness, thus missing process-level errors involving scaffold or policy compliance. OctoBench operationalizes heterogeneous and long-lived constraints found in practical agentic coding environments, such as repository policy files, tool schemas, and contextual reminders, requiring agents to maintain compliance across multi-turn editing workflows (Ding et al., 15 Jan 2026).
2. Scaffold Types, Environment Construction, and Task Taxonomy
OctoBench comprises 34 self-contained Docker environments and 217 coding tasks. Task construction begins with 72 handcrafted “seed” instances, which are then augmented by model-guided synthesis and human verification. The three supported scaffold paradigms are:
| Scaffold Type | Framework/Agent | Notable Mechanisms |
|---|---|---|
| Claude Code | Anthropic terminal agent | Injects CLAUDE.md (≥2.0.69) |
| Kilo | VS Code (extension) | AGENTS.md convention (0.10.2) |
| Droid | Factory.ai end-to-end agent | MCP support, AGENTS.md ingestion (0.42.2) |
Tasks are anchored to six instruction-source categories: Skill.md, CLAUDE.md, AGENTS.md, system prompt, user query, and memory. Each task targets unambiguous, verifiable constraints and is designed to stress concurrent, competing, and hierarchical rules that agents must obey.
3. Checklist-Based Scoring, Observation, and Evaluation Protocol
The benchmark employs 7,098 binary checklist items, averaging 32.7 per instance (median 34), each tagged with one of six check types: compliance (format/style/policy, 79.2%), implementation (12.5%), understanding (4.3%), testing (2.2%), modification (1.3%), and configuration (0.5%). Items are grouped by seven source categories and are dynamically filtered to include only those activated by the observed agent trajectory.
All agent–scaffold interactions are executed in isolated Docker containers under the supervision of a lightweight proxy, which intercepts all LLM calls (messages and tool invocations), producing raw logs. These logs are normalized to a unified JSON schema of { "meta", "tools", "messages" } and paired with the task-specific checklist. Scoring is performed by an LLM-as-judge protocol utilizing consensus among three models (GPT-5.1, Claude-Sonnet-4.5, Gemini-3-Pro), which independently determine binary outcomes for each check, .
Two principal metrics disentangle total compliance from partial credit:
- Instance Success Rate (ISR):
- Checklist Success Rate (CSR):
ISR quantifies the rate of perfect trajectory-level compliance, while CSR measures average per-check adherence across tasks (Ding et al., 15 Jan 2026).
4. Experimental Findings and Systematic Variance
Eight LLMs, representing both open- and closed-source paradigms (Claude-Opus-4.5, Claude-Sonnet-4.5, Gemini-3-Pro, MiniMax-M2, MiniMax-M2.1, Kimi-K2-Thinking, Doubao-Seed-1.8, ChatGLM-4.6), were benchmarked under a consistent temperature (T=1.0) and scaffold-default decoding settings.
Empirical results demonstrate high CSR (79.8–85.6%), indicating that agents typically pass most individual checks, but a markedly lower ISR (9.7–28.1%), signifying that agents rarely achieve perfect multi-constraint compliance. Notable observations include:
- Category-level differences: Memory and simple system reminders yield the highest CSR/ISR; Skill.md and complex tool-schema constraints exhibit the lowest (Skill ISR as low as 12%).
- Cross-scaffold robustness: Performance varies substantially across scaffold types. For Claude-Opus-4.5:
| Scaffold | ISR | CSR |
|---|---|---|
| Claude Code | 28.4% | 84.4% |
| Kilo | 20.0% | 89.3% |
| Droid | 40.2% | 94.6% |
- Conflict resolution: On 32 controlled instances with conflicting sources, models exhibit source-prioritization biases: some prioritize system-level constraints, others defer to user queries.
- Iterative feedback: Providing explicit feedback (rewritten failed checks as hard constraints) significantly improves ISR, exemplified by ChatGLM-4.6 rising from 21.4% to 38.2% (Δ=+16.8%) and a CSR increase of +2.7%.
These results reveal a persistent “scissors gap” between average compliance (CSR) and perfect adherence (ISR), highlighting current model deficiencies in integrating and hierarchizing heterogeneous, persistent instruction sets (Ding et al., 15 Jan 2026).
5. Implications, Limitations, and Prospects for Scaffold-Aware Agents
The OctoBench coding benchmark establishes that prevailing agent architectures and training regimes are typically ill-equipped to optimize joint compliance across persistent and concurrent rule sets. A plausible implication is that explicit instruction hierarchy modeling and memory-persistence mechanisms represent open research frontiers. Evaluation solely by end-result correctness is insufficient; process-level compliance metrics and trajectory-level auditing must be foregrounded in both training and evaluation protocols.
Ongoing directions include broadening scaffold coverage (incorporating further agentic frameworks, enterprise policy sources), advancing deterministic checking to mitigate judge ambiguity, and fostering reproducible experimentation via public release of environments, specifications, and scoring tools (Ding et al., 15 Jan 2026).
6. Optical Turbulence Modeling: The otbench (“OctoBench”) Framework
A separate system, the otbench package (colloquially referenced as “OctoBench”), provides a rigorous, extensible infrastructure for benchmarking optical turbulence models using standardized tasks, metrics, and real-world data sets (Jellen et al., 2024). Its design reflects similar benchmarking principles—dataset diversity, plug-in model support, and reproducible evaluation—but targets regression and forecasting of the refractive-index structure parameter .
Key features include:
- Modular architecture with abstract task and model classes.
- Support for Mona Loa and USNA datasets (NetCDF4 format), with regression and forecasting task instances.
- Standard metrics: RMSE, MAE, MAPE, , bias, and linear correlation, each provided with explicit mathematical formulations.
- Baseline models: persistence, climatology, linear forecast, macro-meteorological parametric equations, LightGBM-based GBRT, and PyTorch RNNs.
- Extension hooks for new datasets, tasks, and custom metrics.
- Utilities for time-series and scatter plot visualization, plus tabular reporting.
Practitioners instantiate tasks, fit models, predict, and evaluate via a uniform workflow. The system’s modularity permits extension through subclassing and metric registry updates. Results interpretation emphasizes comparison of error metrics and variance explained, supporting fair benchmarking in optical turbulence prediction (Jellen et al., 2024).
7. Conclusions
OctoBench designates two specialized benchmarking suites: one advancing the science of scaffold-aware agentic coding through granular rule-compliance diagnostics, and the other standardizing optical turbulence model evaluation via extensible, reproducible task definition and metric reporting. Both frameworks exemplify the trend toward comprehensive, multi-metric, and reproducible evaluation as foundational to progress in complex, instruction- or data-driven research domains (Ding et al., 15 Jan 2026, Jellen et al., 2024).