OT-Agent: Open Framework for Agentic Data
- OT-Agent is an open framework that constructs training datasets for agentic language models using a six-stage data curation pipeline.
- It rigorously benchmarks diverse task sourcing, mixing, filtering, and teacher model strategies, validated by over 100 ablation studies.
- The framework enhances model generalization across agentic tasks by ensuring balanced diversity through synthetic augmentation and multi-turn filtering.
OpenThoughts-Agent (OT-Agent) is an open framework for constructing training datasets tailored to agentic LLMs—models designed for interactive, multi-turn agentic tasks. Unlike previous open efforts (e.g., SWE-Smith, SERA, Nemotron-Terminal) that focus on narrow agentic benchmarks, OT-Agent presents a comprehensive data curation pipeline, rigorously validated through over 100 ablation studies, to produce fine-tuning corpora that support generalization across diverse agentic benchmarks. The OT-Agent project publicly releases its curation pipeline, processed datasets, experimental data, and models to facilitate open research in agentic model training (Raoof et al., 23 Jun 2026).
1. Data Curation Pipeline
The OT-Agent supervised fine-tuning (SFT) data curation pipeline consists of six stages, each systematically ablated to assess its impact on downstream agent performance. The evaluation protocol utilizes Qwen3-8B finetuned across 10,000 examples for seven epochs (learning rate , batch size 96, 32K context), with average z-score computed across SWE-Bench-Verified-100, OT-TBLite-100, and Terminal-Bench 2.0.
1. Task Sourcing:
Ninety-five distinct task generation strategies—including synthetic issue patches (e.g., SWE-Smith), StackExchange human Q&A (SuperUser, Tezos), and coding-contest recasts—are benchmarked. Each source strategy is ranked for benchmark by:
where denotes raw accuracy, with and as mean and standard deviation, respectively. The final rank is obtained via mean z-score across the three benchmarks. Top strategies include swe-smith (+1.92), stackexchange-superuser (+1.51), stackexchange-tezos (+1.45), and issue-tasks (+1.37).
2. Mixing Tasks:
Top sources are mixed by sampling $10,000/N$ tasks per source, with ablated across 0 and two sampling schemes. Best performance is observed at 1 or 2. The Top-4 mix yields superior normalized scores (+0.49 over Top-1).
3. Task Augmentation:
Augmentation strategies—harden, constrain, mixed augmentation, and "trace hints" LLM-driven rewrites—were not found to significantly surpass the non-augmented baseline.
4. Task Filtering:
LLM-based filtering (e.g., selecting tasks with longest GPT-5 solutions, shortest solutions, AskLLM difficulty, embedding-diversity, or random) shows that "longest GPT-5 response" improves mean performance by approximately 3 percentage points.
5. Teacher Model:
Rollout trajectories are generated using various teacher models in the Terminus-2 harness (GPT-5.3-Codex, Kimi K2.5, GLM-4.6, GLM-4.7-AWQ). Despite GPT-5.3-Codex performing best on individual evals, GLM-4.7-AWQ yields higher performance on Terminal-Bench 2.0 (+5 pp), leading to its selection as the final teacher.
6. Rollout Filtering:
Heuristic filters (removing traces with 35 turns, timeouts, or subagent executions) determine that filtering for traces with 4 turns yields a +1.25 normalized average improvement, attributed to multi-turn supervision, as confirmed by compute-controlled ablations.
2. Construction of the 100 K-Example OT-Agent-v2 Training Set
Following the pipeline, the 100,000-example training set (OT-Agent-v2) is composed by sampling equally from swe-smith, issue-tasks, stackexchange-superuser, and stackexchange-tezos (after synthetic augmentation). The smallest source (Tezos, 997 problems) underwent %%%%15116%%%% instruction-rewrites—expanding to approximately 21,000 surface forms—ensuring balanced task diversity and overcoming bottlenecks associated with prompt variation.
| Source Name | No. of Examples | Augmentation Applied |
|---|---|---|
| swe-smith | 25,000 | None |
| issue-tasks | 25,000 | None |
| stackexchange-superuser | 25,000 | None |
| stackexchange-tezos | 25,000 | Instruction rewrites (11×) |
All trajectories are generated by GLM-4.7-AWQ and filtered by 7 turns. This recipe achieves maximal task-description diversity as the primary scaling constraint is surface form diversity rather than raw data quantity.
3. Empirical Setup and Ablation Highlights
All empirical analyses are implemented via a Llama-Factory fork enabling ALST long-sequence support, distributed training using DeepSpeed ZeRO-3, and deployment across 24 H100 nodes. Benchmark environments incorporate Harbor and the Terminus-2 harness within Daytona sandboxes, and all evaluations are conducted over three stochastic re-runs.
Held-in benchmarks include SWE-Bench-Verified-100, OT-TBLite-100, and Terminal-Bench 2.0; held-out (OOD) benchmarks include Aider Polyglot, BFCL-Parity, MedAgentBench, GAIA-127, and FinanceAgent-Terminal. Ablation studies (8) validate that task source choice has the dominant effect on performance (rank spreads 9 pp). Non-monotonic effects are observed in teacher selection, and multi-turn episode filtering is pivotal for learning longer-horizon agentic behaviors. Synthetic augmentation enables the scalability of the dataset past 31.6K examples—a plateau reached when merely repeating the top sources.
4. Fine-Tuning Procedures and Hyperparameters
Qwen3-32B models serve as the foundation for all large-scale experiments, using the "thinking" chat template. Optimization employs AdamW (0, 1, weight decay=0.04), with learning rate 2, cosine annealing (10% warmup), a context window of 32,768 tokens, and a global batch size 396. All model weights are in BF16 format, trained using DeepSpeed ZeRO-3 on 24 nodes with each node running 44GH200 accelerators.
Scaling hyperparameters:
- 100K rows: 5 epochs, max grad norm 5 (approx. 5 hours)
- 31.6K rows: 5 epochs, 6 (approx. 3 hours)
- 10K/3.16K rows: 7 epochs, 7 (0.5–1.5 hours)
5. Performance Evaluation and Benchmark Comparisons
OT-Agent (Qwen3-32B, 100K) achieves a mean accuracy of 44.8% over seven agentic benchmarks, exceeding the baseline Nemotron-Terminal-32B (100K) by +3.9 percentage points (40.9%). On SWE-Bench-Verified-500, OT-Agent attains 54.0% versus 41.9% for Nemotron-Terminal. On Terminal-Bench 2.0, the values are 26.2% (OT-Agent) to 25.1% (Nemotron-Terminal).
Performance scaling curves indicate that, for every data size (8), OT-Agent outperforms Nemotron-Terminal and SERA. OT-Agent's scaling is monotonic and non-saturating up to 100K; Nemotron-Terminal-32B saturates at a lower mean accuracy.
Comparative aggregate results:
| Model + Data (32B) | Approx. Dataset Size | Avg Accuracy (%) |
|---|---|---|
| OT-Agent Qwen3-32B | 100K | 44.8 |
| Nemotron-Terminal-32B | 100K | 40.9 |
| SWE-Smith Qwen3-32B | ≈264K | 34.7 |
| SERA | ≤47K | 28.1 |
OT-Agent also exhibits better generalization on OOD benchmarks, with scaling advantages confirmed at both 8B and 32B parameter regimes.
6. Comparative Analysis with Prior Open Agentic SFT Efforts
Preceding initiatives such as SWE-Smith, SERA, and Nemotron-Terminal have primarily targeted isolated benchmarks and have not systematically addressed generalization across diverse agentic tasks. OT-Agent’s six-stage pipeline—particularly its emphasis on varied task sourcing, comprehensive filtering, and episode trajectory engineering—enables superior benchmark performance and scaling generalization. Notably, the normalization and diversity of data sources, coupled with synthetic augmentation, help bypass plateau effects observed in previous datasets.
A plausible implication is that agentic SFT generalization is most effectively enhanced by maximizing prompt, task, and execution-trace diversity, rather than simply increasing dataset size or relying on a single benchmark’s data distribution.
7. Limitations and Prospective Research Directions
OT-Agent's methodology currently imposes several constraints. RL augmentation, such as “pymethods2test,” has only been tested at 8B scale; scalability to larger models like 32B remains unexplored. All experiments employ the Qwen3 architecture as the base; transferability to architectures with divergent pretraining remains unexamined. Current datasets cap at 100K agentic trajectories; large-scale extension to millions of data points is untested. Future work is anticipated to involve porting the pipeline to Qwen3.5, investigating base model-data interactions, and upscaling RL methods to substantial model sizes.
All datasets, pipelines, experimental logs, and model checkpoints are released publicly at https://openthoughts.ai, enabling independent evaluation and reproducibility (Raoof et al., 23 Jun 2026).