Papers
Topics
Authors
Recent
Search
2000 character limit reached

OpenThoughts-Agent: Data Recipes for Agentic Models

Published 23 Jun 2026 in cs.AI | (2606.24855v1)

Abstract: Agentic LLMs dramatically expand the applications of AI yet little is publicly known about how to curate training data for broadly capable agents. Existing open efforts such as SWE-Smith, SERA, and Nemotron-Terminal typically target a single benchmark, leaving open the question of how to train models that generalize across diverse agentic tasks. The OpenThoughts-Agent (OT-Agent) project addresses this gap with a fully open data curation pipeline for training agentic models. We conduct more than 100 controlled ablation experiments to systematically investigate each stage of the pipeline, yielding insights on the importance of task sources and diversity. We then assemble a training set of 100K examples from our pipeline and fine-tune Qwen3-32B on this dataset, which yields an average accuracy of 44.8% across seven agentic benchmarks and a 3.9 percentage point improvement over the strongest existing open data agentic model (Nemotron-Terminal-32B, 40.9%). Moreover, our training data exhibits strong scaling properties, outperforming alternative open datasets at every training set size in compute-controlled comparisons. We publicly release our training sets, data pipeline, experimental data, and models at openthoughts.ai to support future open research on agentic model training.

Summary

  • The paper presents a reproducible agentic data curation pipeline that significantly improves model performance by optimizing task sourcing, mixing, and trajectory filtering.
  • It demonstrates that careful teacher model selection and LLM-driven difficulty filtering are critical for achieving robust supervised fine-tuning outcomes.
  • The integration of RL post-training with synthetic task augmentation further boosts agent performance, highlighting the complementary roles of SFT and RL.

OpenThoughts-Agent: Data Recipes for Agentic Models

Motivation and Problem Statement

Agentic LLMs have rapidly evolved beyond static QA, enabling multi-step problem-solving and tool-integrated operations in real-world environments. However, best practices for curating training data that induce robust agentic behaviors remain poorly characterized in the literature. Prior open efforts tend to specialize for a single benchmark, failing to generalize across the heterogeneous task distribution encountered by practical agents. The OpenThoughts-Agent (OT-Agent) project directly addresses this gap by introducing a fully open, reproducible data curation pipeline for both SFT and RL post-training of agentic models, accompanied by extensive controlled ablation experiments.

Pipeline Ablation and Empirical Insights

OT-Agent systematically ablates six core stages in the SFT data pipeline: task sourcing, task mixing, description augmentation, task filtering, teacher selection, and agentic trajectory filtering. More than 100 dataset variants are evaluated for downstream impact using fine-tuning protocols on Qwen3-8B and Qwen3-32B, measured against multiple agentic benchmarks (including SWE-Bench Verified, Terminal-Bench 2.0, Aider Polyglot, BFCL-Parity, GAIA-127, MedAgentBench, and FinanceAgent-Terminal).

Key empirical results are:

  • Task sourcing strategy exerts the largest variation, with up to 30pp accuracy difference across SWE-Bench Verified and 10pp on Terminal-Bench 2.0. Synthetic issue-resolution tasks (SWE-Smith, Issue Tasks) and human-generated infrastructure questions (StackExchange SuperUser/Tezos) dominate performance.
  • Task mixing mitigates over-specialization: including top-4 to top-8 task sources yields optimal balance across benchmarks, outperforming single-source datasets.
  • LLM-driven difficulty-based filtering (selecting tasks requiring longer GPT-5 responses) reliably lifts benchmark performance (≈+3pp) relative to random subsampling.
  • Task description augmentation (via constraint/hardening or synthetic rewriting) yields negligible improvement, suggesting that high-quality source diversity trumps superficial augmentation.
  • Teacher model selection is non-monotonic: GLM-4.7-AWQ outperforms more advanced models (like GPT-5.3-Codex and Kimi K2.5) as a teacher, indicating teacher-student alignment and harness compatibility are critical.
  • Agentic trajectory filtering based on minimum turns (\geq5) increases downstream accuracy, even under fixed token budgets, confirming that multi-turn supervision drives capability gains.

Scaling Laws and Data Composition

The data pipeline scales favorably. At 32B, performance continues to rise up to 100K datapoints, but plateaus if only rollouts-per-task are upsampled. Instead, synthetic task augmentation (especially for sources with limited unique tasks, e.g., Tezos) enables monotonic scaling, bypassing diversity bottlenecks. Adding sources beyond top-4 fails to uplift performance, confirming diminishing returns from over-broad source aggregation.

OpenThoughts-Agent-v2 (100K traces from top 4 sources with synthetic augmentation and deep trajectory filtering) yields Qwen3-32B agents that achieve:

  • 54.0% SWE-Bench Verified
  • 26.2% Terminal-Bench 2.0
  • 44.8% average across seven core benchmarks

These results exceed Nemotron-Terminal-32B (40.9% average), establishing the strongest open-data recipe for Qwen3-32B or earlier models. Scaling at 8B follows similar trends: OT-Agent improves monotonically up to 100K rows, outpacing Nemotron-Terminal-Corpus at every scale.

RL Data Curation and Behavioral Effects

The RL post-training pipeline is subjected to analogous ablations on data source, using the RLOO algorithm on Qwen3-8B with 24xA100 infrastructure. The strongest RL source is pymethods2test—a competitive programming dataset recast as Python contracts with concise, moderate-difficulty tasks. RL from this source induces substantial behavioral expansion: increased tool calls, longer traces, more reasoning, and higher self-correction, without reward hacking. This is confirmed through pairwise LLM judge analysis and time-binned behavioral metrics.

Contrasting RL runs from heterogeneous tool-use datasets (e.g., LLM-verifier-freelancer) show compaction (fewer tool calls, shorter traces), implying RL reward trajectory interacts deeply with policy exploration. Combining moderately strong SFT with RL (SFT+RL) on OT-Agent-ColdSFT yields maximum average accuracy at 8B, outperforming RL-only and pure SFT models by up to 18pp.

Implications, Limitations, and Future Directions

OT-Agent demonstrates that open, reproducible agentic data curation pipelines can deliver state-of-the-art model performance over a broad suite of benchmarks. The findings provide strong evidence that:

  • Task source diversity and multi-turn trajectory filtering are the main levers for improving agentic behavior.
  • Teacher model selection must be tuned to the harness/environment, and advanced models are not universally better as teachers.
  • RL post-training complements SFT; the reward landscape of RL data sources dictates exploration versus compaction in emergent agentic behaviors.

From a practical perspective, these insights enable construction of robust, generalist agents for complex tasks spanning coding, terminal operations, financial analysis, and healthcare. The theoretical implication is that scaling laws for agentic models depend heavily on data composition and supervision depth, not just data volume or base-model size.

Limitations include lack of ablation on base model pretraining, compute-constrained RL experiments at the 8B scale, and untested extrapolation to multi-million trajectory regimes. The open release of datasets, code, and models at openthoughts.ai sets the stage for further open research.

In future directions:

  • Translating these recipes to more advanced base models (e.g., Qwen3.5) and larger parameter scales.
  • Joint optimization of SFT and RL data pipelines for richer agentic capabilities.
  • Automated task sourcing and trajectory filtering using meta-learning and active evaluation.
  • Deeper exploration of the harness-teacher interplay and reward shaping for long-horizon tasks.

Conclusion

OpenThoughts-Agent establishes a reproducible, quantitatively substantiated pipeline for agentic SFT and RL data curation. Through extensive ablation, synthetic scaling, and reward analysis, the project achieves best-in-class accuracy for open-data 32B models across multiple agentic benchmarks and demonstrates that RL can further amplify agentic policy capabilities at smaller scales. The released artifacts will facilitate community-driven progress in agentic AI, with future research poised to extend these recipes to new architectures, datasets, and domains (2606.24855).

Whiteboard

Explain it Like I'm 14

Explaining “Open Thoughts Agent: Data Recipes for Agentic Models”

What this paper is about

This paper is about building better “agentic” AI models—AIs that don’t just chat, but can actually use a computer: open a terminal, run commands, edit files, fix code, and complete complicated multi-step tasks. The authors focus on a simple question with a big impact: what kind of training data helps these computer-using AIs learn best? They design and test a complete “data recipe” (pipeline) for training such agents and share everything openly so others can learn and build on it.

The big questions they asked

The researchers set out to answer, in plain terms:

  • Which kinds of tasks should we collect for training an AI agent that needs to work across many different jobs (not just one test)?
  • How should we mix and filter those tasks to get the best results?
  • Which “teacher” AI should generate step-by-step examples the student model will learn from?
  • Do longer “how-to” examples help more than shorter ones?
  • When we scale up the dataset, what actually improves performance: adding more of the same, or adding more variety?
  • How does reinforcement learning (learning from trial and error with rewards) fit together with supervised fine-tuning (learning from worked examples)?

How they did it (in everyday terms)

Think of training an AI agent like training a team player for a series of tournaments:

  • Supervised Fine-Tuning (SFT) is like studying solved practice problems with full solutions. The AI reads the “play-by-play” of how a teacher model solved different tasks.
  • Reinforcement Learning (RL) is like scrimmage: the AI tries things on its own and gets points for success or failure, then adjusts.

They built a six-step “data kitchen” (pipeline) for SFT:

  1. Sourcing tasks: Collecting lots of different computer tasks (like fixing software bugs, using the terminal, or answering tech questions).
  2. Mixing tasks: Deciding how many tasks to take from each source so the training set isn’t too narrow.
  3. Augmenting tasks: Trying to rewrite or “harden” task descriptions to see if that helps (it mostly didn’t).
  4. Filtering tasks: Using signals from a strong model to keep tasks that seem more meaningful or challenging.
  5. Choosing a teacher: Picking an AI to produce step-by-step solutions (trajectories) for each task.
  6. Filtering rollouts: Keeping the solutions that show more back-and-forth steps (more “turns”), which contain richer guidance.

They ran over 100 careful “ablation” experiments—changing one ingredient at a time—to see what really matters. Then they scaled up and trained models on datasets of 10K, 31.6K, and 100K examples. Finally, they tested the models on seven different “benchmarks” (standardized tests) that cover coding, terminal use, finance, healthcare, and general tasks.

Key terms explained:

  • Benchmark: A set of test tasks used to measure how good a model is—like different tournaments.
  • Trajectory: The full, step-by-step reasoning and actions an AI uses to solve a task—like a play-by-play record.
  • Turns: The back-and-forth steps the AI takes while working—more turns usually means more detailed thinking and tool use.
  • Teacher model: A stronger AI that provides example solutions for the student model to learn from—like a coach demonstrating drills.
  • Sandbox/harness: A safe, controlled computer environment where the agent runs commands without causing harm—like a practice field.

What they found

Here are the most important results, written simply:

  • Choosing the right task sources matters a lot
    • Some sources (like software bug-fixing tasks and real tech Q&A from places like StackExchange) helped much more than others.
    • Mixing a few top sources (about 4 to 8) worked better than just using the single best source, because it keeps the training balanced.
  • Strongest model isn’t always the best teacher
    • Surprisingly, the teacher that produced the most helpful step-by-step examples wasn’t the overall top-scoring model on benchmarks.
    • A model called GLM-4.7 (quantized) gave better training examples than a newer, stronger model in this setup.
  • Longer solution traces help
    • Keeping training examples with more turns (richer, multi-step guidance) improved results across benchmarks.
  • Task rewriting didn’t beat the original
    • Fancy ways to rewrite or “harden” task descriptions didn’t reliably help; the plain original tasks worked best.
  • Smart filtering helps
    • Using another model’s “this looks hard/long” signal to pick tasks improved results by about 3 percentage points over random selection.
  • Scaling the dataset the right way matters
    • Just adding more solutions to the same tasks plateaued.
    • Adding more task variety—especially via careful synthetic augmentation when sources were small—kept improving performance.
  • Supervised + Reinforcement learning can stack
    • At the 8B model size, doing SFT first and then RL (with a well-chosen RL dataset) beat both SFT-only and RL-only approaches.

How much better did it get?

  • Their 32B model (fine-tuned on 100K examples) reached an average accuracy of 44.8% across seven benchmarks, beating a strong open-data baseline (40.9%).
  • It scored 54.0% on SWE-Bench Verified (fixing real bugs) and 26.2% on Terminal-Bench 2.0 (general terminal use), outperforming earlier open-data models at the same size.
  • At the 8B size, their two-stage SFT+RL pipeline also improved average performance over other 8B baselines.

Why this is important:

  • It shows that the “recipe” for data—what tasks you pick, how you filter them, who “teaches,” and how long the examples are—can be more important than simply training bigger or longer.
  • It provides an open, reproducible way to build broadly capable agents rather than ones that only do well on a single test.

Why this matters

If you want AI agents that can handle real-world computer work—debugging code, configuring systems, writing scripts, and researching—then you need the right training data. This paper gives a tested, open-source recipe for curating that data and shows it actually scales and generalizes across different kinds of tasks. It helps the whole research community move faster because:

  • The datasets, pipeline, and models are publicly released.
  • The experiments reveal which choices truly change performance.
  • It encourages building agents that work well across many challenges, not just one.

A few caveats

  • The RL part was tested only on 8B models due to compute limits; it’s not yet confirmed how well it scales to larger models like 32B.
  • All SFT runs used the Qwen3 family as the base model, so results may differ with other base models.
  • The largest dataset here is 100K trajectories; future work will need to test much larger scales.

Takeaway

To train better computer-using AI agents, it’s not just about more data—it’s about better-chosen, better-mixed, and better-filtered data, with the right kind of teacher examples and multi-step traces. This paper delivers an open, tested approach that improves agents across many benchmarks, and it shares the tools so others can keep improving on it. You can find the released data, pipeline, and models at openthoughts.ai.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of the paper’s unresolved issues and open problems to guide future research:

  • Scaling beyond 100K trajectories
    • Do the observed data curation trends (e.g., source selection, trace-length filters, augmentation) hold at million-scale SFT datasets, and what are the compute–data trade-offs at that regime?
  • Base model dependence
    • How do the data recipes transfer to other base families (e.g., Qwen-3.5, Llama, Mistral, or proprietary bases), and what interactions exist between pretraining corpus, base capabilities, and SFT/RL data design?
  • RL at larger scales
    • Does the 8B RL recipe (RLOO, binary test-based rewards, async training) transfer to 32B+ models, and how do stability, sample efficiency, and performance scale with parameter count?
  • RL algorithm and reward design
    • How do alternatives to RLOO (e.g., PPO, GRPO, DPO-like objectives for agents, off-policy methods) and richer rewards (partial credit, dense tool-usage shaping, runtime/latency costs) affect performance and generalization?
  • SFT–RL interaction and catastrophic forgetting
    • RL improved averages but degraded some core scores (e.g., SWE-Bench vs SFT-only). How can we design multi-objective SFT+RL curricula or constraints to preserve strengths while gaining RL benefits?
  • Teacher selection mechanics
    • Why can a stronger model be a worse teacher (e.g., GPT-5.3-Codex underperforming GLM-4.7-AWQ)? What teacher properties (verbosity, exploration style, tool usage patterns) predict better distillation outcomes?
  • Teacher mixture and ensemble distillation
    • Would mixing multiple teachers (or switching teachers by domain/task) improve dataset quality versus a single-teacher pipeline?
  • Verification of rollout correctness
    • The SFT pipeline filters by length/turns/timeouts but does not report filtering by task success. How do success-validated traces versus “all traces” impact downstream performance and overfitting to failure modes?
  • Trace structure beyond length
    • Which fine-grained structural features of trajectories (e.g., planning steps, tool–think interleaving, error-recovery patterns) matter most, independent of token count?
  • Chain-of-thought/thinking token handling
    • When and how should “thinking” or intermediate reasoning be retained, masked, or transformed in agentic traces to optimize SFT learning and avoid exposing private CoT while preserving utility?
  • Task augmentation efficacy
    • LLM-driven augmentation strategies showed no reliable gains in ablations. Which augmentation families (e.g., programmatic mutation, adversarial/hard-negative synthesis, environment parameterization) actually improve robustness?
  • Diversity vs. specialization at scale
    • At 100K scale, adding sources beyond Top-4 gave negligible or negative returns. Is this due to redundancy, domain misalignment, or mixing strategy? How can we measure and enforce “useful” diversity?
  • Difficulty and selection signals
    • Response-length (from a closed model) was an effective selection signal. Are there open, reproducible alternatives (e.g., open-model perplexity, solver confidence, verifier-based difficulty) that correlate better with true task hardness?
  • Closed-model dependence and reproducibility
    • The pipeline uses proprietary models (e.g., GPT-5 variants) for filtering and teacher ablations. What open-model surrogates can reproduce these gains, and how sensitive are results to specific proprietary APIs?
  • Harness and environment sensitivity
    • Results are largely reported in the terminus-2 harness and Daytona sandboxes. How portable are gains across harnesses (e.g., SWE-agent, Harbor variations), OS distributions, package managers, shell variants, and resource constraints?
  • Evaluation breadth and OOD gaps
    • Despite strong averages, performance lagged on MedAgentBench relative to some baselines. What domain gaps in the SFT/RL data cause this, and how can healthcare/biomed tool-use be incorporated without harming core SWE performance?
  • Generalization to non-terminal agents
    • How do data recipes extend to web agents, API-based tool ecosystems, GUI agents, or embodied/robotic settings where environment dynamics and observability differ from terminal workflows?
  • Deduplication and contamination controls
    • The paper does not detail rigorous dedup/collision procedures across training sources and test tasks. What is the residual overlap with benchmarks (e.g., SWE-Bench variants), and how does dedup impact results?
  • Long-context training ablations
    • The setup uses long-sequence training (ALST) but does not ablate context length, windowing, or retrieval strategies. How do these choices affect agent performance and trace learning efficiency?
  • Token/latency efficiency
    • Longer traces help, but increase training and inference costs. What data or loss-level techniques (e.g., segment prioritization, contrastive trace ranking, compression) preserve benefits while reducing token budgets and latency?
  • RL data source mixing and curricula
    • RL source ablations varied single sources but not mixtures or staged curricula. Can staged RL regimes that blend single-function code tasks with multi-file/tool-use tasks improve both ID and OOD without trade-offs?
  • Stability and run-to-run variance
    • With n=3 evaluation runs and ~2-point variance noted for RL, what strategies (e.g., seeds, ensembling, bootstrapping) or statistical protocols are needed to robustly compare data recipes?
  • Safety and misuse risk in agent data
    • The dataset includes security/system tasks but does not analyze safety-specific failure modes, exploitability, or harmful emergent behaviors. How can we curate “safe-by-design” agent traces and benchmarks?
  • Licensing and legal constraints
    • The paper does not analyze data licensing constraints (e.g., StackExchange content, synthetic issues from public repos) and their implications for redistribution and downstream use.
  • Multi-agent and subagent traces
    • Subagent traces were filtered out, but the potential benefits of multi-agent decomposition (planner–executor patterns, reviewer loops) remain unexplored for training more capable agents.
  • Curriculum and difficulty pacing in SFT
    • Task mixing used Top-N selections but did not explore curricula (easy-to-hard or capability-targeted pacing). Can curricula improve data efficiency and OOD robustness?
  • Error taxonomy and qualitative diagnostics
    • The paper reports aggregate accuracies but provides limited analysis of failure types (e.g., tool mis-use vs reasoning vs environment setup). What error taxonomies and diagnostics can best inform next-round data curation?

Practical Applications

Immediate Applications

Below are concrete, deployable use cases that can be put into practice now using the released datasets, models, and pipeline components at openthoughts.ai.

  • Software engineering copilots for terminals and repos
    • Sectors: Software, DevOps, IT, MLOps
    • What to deploy: Use OpenThinker-Agent-32B as an internal code/terminal assistant for bug fixing (SWE-Bench-style), repository maintenance, dependency updates, CI/CD triage, environment setup, and shell automation. Integrate with sandboxed environments (e.g., Daytona) and the Terminus-2 harness for safe tool use.
    • Why it’s supported by the paper: The 32B SFT model achieves 54.0% on SWE-Bench Verified and 26.2% on Terminal-Bench 2.0, outperforming prior open-data baselines at <=32B.
    • Assumptions/dependencies: Access to GPU inference, sandboxing infra (Daytona-like), Qwen3-32B licensing, internal security review for agent execution.
  • Rapid, domain-specific agent dataset curation using the six-stage SFT pipeline
    • Sectors: Software, Healthcare IT, Finance IT, Education (CS labs), Enterprise IT
    • What to deploy: Replicate the six-step recipe (source tasks → mix → filter tasks → generate rollouts → filter rollouts → select teacher) to build targeted SFT datasets for your domain (e.g., EHR scripting tasks, finance terminal workflows). Apply the paper’s strongest choices: mix Top-4 sources, filter by LLM response-length signal, and keep traces with ≥5 turns.
    • Why it’s supported by the paper: >100 controlled ablations identify high-impact levers (instruction choice, multi-source mixing, LLM difficulty filter, min-turns filtering).
    • Assumptions/dependencies: Access to a capable “difficulty signal” LLM for filtering (can be substituted), compute for rollout generation, data rights for any external sources (e.g., StackExchange).
  • Cost-sensitive agent improvement via RL on small models
    • Sectors: Startups, Edge/On-prem, Cost-constrained orgs
    • What to deploy: Improve an 8B model using the released RL dataset and recipe (RLOO, binary pass/fail rewards) with pymethods2test as the source. Start from a modest SFT checkpoint (“cold SFT”), then do RL to boost terminal/code performance at lower inference cost.
    • Why it’s supported by the paper: The SFT+RL 8B pipeline surpasses SFT-only 8B baselines on average across 7 benchmarks; data-source ablation shows pymethods2test gives best ID/OOD balance.
    • Assumptions/dependencies: A100-class GPUs (or equivalent), reproducible RL environment, alignment with the paper’s reward shaping and harness.
  • Evidence-based teacher selection for distillation
    • Sectors: AI labs, Enterprise ML teams
    • What to deploy: Choose a teacher by empirical downstream results rather than frontier capability. The paper finds GLM-4.7-AWQ outperforms stronger general models (e.g., GPT-5.3-Codex) as a teacher for these agent tasks.
    • Why it’s supported by the paper: Teacher ablations show “strongest model ≠ best teacher” for agent rollouts.
    • Assumptions/dependencies: Access to multiple candidate teachers, controlled evaluations with Terminus-2 to select the best teacher for your task distribution.
  • Synthetic task augmentation to break data plateaus
    • Sectors: AI labs, Internal tooling teams
    • What to deploy: When unique task descriptions are scarce, apply instruction rewriting to expand surface forms and maintain diversity. Use the response-length signal as an upsampling weight rather than a hard filter.
    • Why it’s supported by the paper: Upsampling rollouts per task plateaus, but synthetic augmentation continues to improve performance from 31.6K to 100K scale.
    • Assumptions/dependencies: Access to an LLM for rewriting; balanced augmentation to avoid overfitting to superficial patterns.
  • Data QA heuristics that correlate with agent quality
    • Sectors: AI labs, DataOps for LLMs
    • What to deploy: Integrate two simple gates into your data pipeline: (1) select tasks that elicit longer responses from a difficulty-signal LLM, and (2) retain agent rollouts with ≥5 turns.
    • Why it’s supported by the paper: +~3 pp average from response-length filtering and consistent gains from min-turn filtering at matched token budgets.
    • Assumptions/dependencies: Access to any capable LLM for difficulty estimation; careful token-budget matching to isolate quality vs. compute.
  • Reproducible agent evaluation inside sandboxes
    • Sectors: Software, Cybersecurity, Public sector IT
    • What to deploy: Standardize on Terminus-2 and Daytona (or equivalents) for isolation and reproducibility. Use OT-TBLite (100 tasks) as a fast proxy for Terminal-Bench 2.0 to iterate quickly before running the full suite.
    • Why it’s supported by the paper: The entire experimental stack runs in isolated sandboxes with consistent harnesses; OT-TBLite is validated as a fast proxy.
    • Assumptions/dependencies: Sandbox orchestration, budget for repeated eval runs (n=3), CI integration.
  • Academic teaching modules and replicable lab assignments
    • Sectors: Academia, Workforce training
    • What to deploy: Course units that reconstruct the six-stage pipeline; student projects on task sourcing, teacher selection, and filtering; lab use of OT-Agent datasets/models for reproducibility.
    • Why it’s supported by the paper: Fully open release of data, pipeline, and models, plus controlled ablations suited for instruction.
    • Assumptions/dependencies: Institutional compute or cloud credits; instructor familiarity with harnesses and sandboxing.
  • Procurement and audit checklists for agentic systems
    • Sectors: Policy, Public procurement, Regulated industries
    • What to deploy: Require vendors to disclose task sources, teacher choice rationale, filtering rules (e.g., min-turn trace filter), sandboxing practices, and benchmark results under standard harnesses.
    • Why it’s supported by the paper: Demonstrates that training data recipes materially alter capability; auditability is necessary for reliability and safety.
    • Assumptions/dependencies: Organizational policy alignment; availability of independent evaluation infrastructure.
  • Power-user terminal assistants for everyday automation
    • Sectors: Daily life, SME IT
    • What to deploy: Use released models for safe shell tasks in local containers: environment setup scripts, log parsing, one-off utilities, and repository patch suggestions, with human-in-the-loop review.
    • Why it’s supported by the paper: Strong terminal/coding performance in sandboxed environments and generalization to OOD (e.g., FinanceAgent-Terminal).
    • Assumptions/dependencies: Local containerization, user proficiency to supervise and approve agent actions.

Long-Term Applications

These use cases are viable directions contingent on further research, scaling, or ecosystem development.

  • Enterprise “Agent DataOps” platforms
    • Sectors: Software, Platform providers
    • What to build: Managed systems that implement the six-stage pipeline end-to-end (task sourcing, mixing, augmentation, rollout generation, multi-turn filtering, teacher orchestration) with dashboards, token/compute accounting, and continual evaluation.
    • Dependencies: Robust data rights, scalable sandboxing, automated teacher/routing selection, and multimillion-trajectory pipelines beyond the paper’s 100K scale.
  • Cross-domain agent training beyond code/terminal
    • Sectors: Robotics, IoT, Scientific computing, Healthcare ops
    • What to build: Domain-specific harnesses (robotics simulators, lab-automation shells, EHR-safe tool layers) to extend the data-recipe approach outside software engineering.
    • Dependencies: High-fidelity, safe environments; domain-specific reward verifiers; privacy/compliance adapters (HIPAA, GDPR).
  • RL at larger scales and with richer rewards
    • Sectors: AI labs, Model providers
    • What to build: SFT+RL recipes for 32B+ models, richer reward models (coverage, latency, safety), and online curriculum learning that switches data sources as policies improve.
    • Dependencies: Significant compute; stability research for long-horizon RL; automated reward verification infra; confirmation that 8B findings transfer to larger scales.
  • Standardized certification and safety governance for agentic AI
    • Sectors: Policy, Regulators, Critical infrastructure
    • What to build: Certification programs requiring reproducible data recipes, sandbox-first execution, and benchmark disclosures (Core and OOD). Develop incident reporting and rollback standards for agent malfunctions.
    • Dependencies: Consensus benchmarks/harnesses, third-party auditors, legal frameworks for agent execution and accountability.
  • Privacy-preserving and federated agent data curation
    • Sectors: Healthcare, Finance, Enterprise IT
    • What to build: Federated pipelines that apply the paper’s data recipes on sensitive, non-exportable corpora, with secure rollout generation and difficulty estimation on-prem.
    • Dependencies: Differential privacy or secure enclaves, local teacher availability, federated evaluation procedures.
  • Automated teacher orchestration and data-source routing
    • Sectors: AI tooling
    • What to build: Systems that learn which teacher/model produces the most useful rollouts for a given task family and dynamically route tasks; adaptive Top-N mixing that optimizes generalization vs. specialization online.
    • Dependencies: Real-time meta-evaluation, cost/latency trade-off managers, and robust uncertainty estimates.
  • AgentOps for CI/CD and production reliability
    • Sectors: Software, DevOps
    • What to build: “Agent-in-the-loop” CI systems that use the dataset and recipes to train agents to propose patches, explain failures, and follow runbooks with guardrails (policy checks, code owners, sandbox constraints).
    • Dependencies: Strong policy engines, secure secret handling, human approval flows, and rollback strategies.
  • Domain-aligned educational copilots and auto-tutors
    • Sectors: Education
    • What to build: Auto-tutors that present escalating terminal/coding tasks (curriculum from task-source mixing), give stepwise hints, and log multi-turn traces for formative assessment.
    • Dependencies: Age/skill-appropriate safety constraints, plagiarism mitigation, and instructor tooling for oversight.
  • Finance and healthcare research assistants with tight controls
    • Sectors: Finance, Healthcare research
    • What to build: Sandboxed analyst terminals (for filings, market data) and lab-terminal assistants (pipeline orchestration, data parsing) trained with domain-specific task sources and the paper’s filters.
    • Dependencies: Data-licensing, compliant logging/auditing, domain reward verifiers, alignment with risk and safety policies.

Notes on assumptions and dependencies (common across many applications)

  • Compute and infrastructure: Training and RL steps require substantial GPU resources; deployment benefits from containerized sandboxing (Daytona-like) and standardized harnesses (Terminus-2).
  • Model/base dependencies: Results reported for Qwen3 family; porting to other bases (e.g., Qwen3.5) needs validation.
  • Data rights and licensing: Task sources (e.g., StackExchange) and synthetic augmentations must respect licensing and terms of use.
  • Difficulty-signal LLM: The response-length filtering assumes access to a capable LLM; a local substitute can be used if validated.
  • Safety and oversight: Agentic systems are dual-use; production deployments should enforce sandbox isolation, least-privilege access, human-in-the-loop approvals, and comprehensive logging.
  • External validity: RL results were validated at 8B scale; generalization to larger models or non-terminal domains requires further research.

Glossary

  • Ablation experiments: Controlled studies that remove or vary components of a system to measure their impact on performance. "We conduct more than 100 controlled ablation experiments to systematically investigate each stage of the pipeline"
  • Agent harness: A standardized evaluation interface that runs agents against tasks and measures outcomes. "using the terminus-2 [28] agent harness"
  • Agentic LLMs: Models designed to act autonomously, using tools and performing multi-step tasks over extended interactions. "Agentic LLMs have dramatically expanded the applications of AI."
  • Agentic rollouts: Full recorded trajectories of an agent’s actions and tool calls while solving a task, used for training or evaluation. "We generate agentic rollouts using GLM-4.7-AWQ and filter out traces with fewer than 5 turns."
  • ALST long-sequence training: A training technique enabling efficient optimization on very long token sequences. "extending it to support ALST long-sequence training [50, 4]."
  • AWQ: Activation-aware weight quantization; a method to reduce model precision and memory while preserving accuracy. "trajectories are generated by GLM-4.7-AWQ [41] acting as the teacher"
  • Compute-controlled comparisons: Evaluations where training or inference compute is matched to isolate data or method effects. "outperforming alternative open datasets at every training set size in compute-controlled comparisons."
  • Context length: The maximum number of tokens the model can process in a single sequence. "32,768 context length."
  • Cosine schedule: A learning-rate schedule that follows a cosine curve, typically decaying over training. "learning rate 4e-5 with cosine schedule"
  • Daytona sandboxes: Isolated execution environments used to safely run agent-generated code and tasks. "inside Daytona sandboxes"
  • Execution traces: Detailed logs of an agent’s step-by-step actions and tool interactions during task solving. "Filtering training data to retain the execution traces with more model turns improves the resulting training sets."
  • Full-parameter SFT: Supervised fine-tuning that updates all model parameters rather than a subset or adapters. "we finetune Qwen3-8B [45] on each dataset using full-parameter SFT"
  • Long-horizon: Tasks requiring coherent reasoning and action across many steps or turns. "reason coherently over a long horizon."
  • Out-of-distribution (OOD): Data or benchmarks that differ significantly from the training distribution, used to test generalization. "our pipeline experiments exclude a held-out set of out-of-distribution benchmarks"
  • Post-training: Additional training (e.g., SFT or RL) applied to a pretrained base model to specialize capabilities. "we focus on post-training data for supervised fine-tuning (SFT)"
  • Reinforcement learning (RL): Training paradigm where agents learn policies by receiving rewards from interactions with an environment. "we also study data curation for reinforcement learning (RL)."
  • Response-length signal: A heuristic using the number of tokens an LLM outputs as a proxy for task difficulty or informativeness. "We then use the gpt-5-nano response-length signal from Section 3.4 as upsampling weights"
  • RLOO algorithm: A REINFORCE-style optimization method using a leave-one-out baseline for reward estimation in RLHF. "We run async RL using the RLOO algorithm [1]"
  • Scaling laws: Empirical relationships showing how model performance scales with data, parameters, or compute. "The advent of LLMs and scaling laws progressively made clear the large role data curation plays"
  • SkyRL framework: An open framework for training long-horizon agents with reinforcement learning. "Our reinforcement learning framework was an extended version of the popular SkyRL framework"
  • Standard error: A measure of variability in a sample estimate, used to indicate uncertainty in reported metrics. "with n = 3 stochastic re-runs per task and standard error reported across trials."
  • Stochastic re-runs: Multiple randomized evaluations of the same task to average out variability in agent behavior. "with n = 3 stochastic re-runs per task"
  • Supervised fine-tuning (SFT): Training a model to imitate correct outputs on labeled trajectories or examples. "we focus on post-training data for supervised fine-tuning (SFT)"
  • Teacher model: A higher-performing model that generates trajectories or guidance used to train a student model. "Prior work has shown that the choice of teacher can make a significant difference in downstream evals [16]."
  • Terminus-2 harness: A specific agent evaluation framework used to run and score tasks across benchmarks. "All evaluations run inside isolated Daytona sandboxes [9] using the terminus-2 [28] agent harness"
  • Trajectory: The sequence of states, actions, and outputs produced by an agent while solving a task. "create the best dataset of (task, trajectory) pairs for supervised finetuning"
  • Upsampling: Increasing the number of training examples from certain sources or tasks, often by duplicating or weighting. "Method 1 (upsampling additional rollouts per task description) plateaus from 31.6K to 100K"
  • Z-score: A standardized metric expressing how many standard deviations a value is from the mean, used to normalize results. "We compute the z-score of every candidate strategy's accuracy across the stage's full candidate set"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 544 likes about this paper.