OpenThoughts-Agent: Data Recipes for Agentic Models

This presentation explores the OpenThoughts-Agent project, which introduces the first fully open and reproducible data curation pipeline for training agentic language models. Through systematic ablation of over 100 dataset variants across six core pipeline stages, the researchers identify key levers for robust agentic behavior: task source diversity, multi-turn trajectory filtering, and teacher-student alignment. The work demonstrates that careful data composition outweighs superficial augmentation, achieving state-of-the-art performance on seven benchmarks including 54% on SWE-Bench Verified, and shows how reinforcement learning complements supervised fine-tuning to amplify exploration and reasoning in compact models.
Script
Training agents that can navigate real-world tasks requires more than powerful base models. It demands the right training data, filtered and composed with precision. The OpenThoughts-Agent project reveals exactly which data recipes produce the most capable agentic models.
The researchers systematically ablated six stages of their data pipeline across more than 100 dataset variants. Task sourcing emerged as the dominant lever, creating up to 30 percentage point swings in accuracy on SWE-Bench Verified depending on which repositories and issue types were selected.
Diversity beats depth in unexpected ways. Mixing the top 4 to top 8 task sources produced the best generalization, while adding more sources yielded diminishing returns. High-quality synthetic issue-resolution tasks and human-generated infrastructure questions dominated performance, proving that source selection matters far more than superficial augmentation.
Multi-turn supervision unlocks deeper capabilities. Filtering for trajectories with at least 5 turns increased downstream accuracy even under fixed token budgets, confirming that agentic behavior emerges from interactive, multi-step problem-solving patterns rather than single-shot responses.
Reinforcement learning doesn't just improve accuracy; it fundamentally reshapes agent behavior. When applied to competitive programming tasks recast as Python contracts, RL induced dramatic behavioral expansion with more tool calls, longer reasoning traces, and increased self-correction, while other data sources caused compaction, revealing that the reward landscape of the training task dictates how agents explore.
By achieving 54% on SWE-Bench Verified and the strongest open-data results for 32 billion parameter models across seven benchmarks, OpenThoughts-Agent proves that reproducible data pipelines can match proprietary approaches. The researchers have open-sourced everything at openthoughts.ai, where you can explore the datasets, ablation results, and trained models, or even create your own videos about agentic AI by visiting EmergentMind.com.