LLM-in-Sandbox Elicits General Agentic Intelligence

An overview of a method that grants LLMs access to a general-purpose code sandbox to solve non-code tasks, using a novel RL training strategy to improve tool use and exploration.
Script
What happens when you take a language model out of the chat window and give it full control of a virtual computer? This paper explores how granting Large Language Models access to a fully functional code sandbox allows them to solve complex tasks far beyond simple text generation.
Models typically struggle with strict constraints because they cannot verify their own work or manage long-term memory effectively. By contrast, this research argues that pairing a model with a sandboxed operating system unlocks general intelligence, allowing agents to manage files, run code, and install their own tools on the fly.
To bridge the gap between reasoning and action, the proposed sandbox provides three core capabilities: external access, file management, and code execution. Rather than relying on task-specific environments, the system uses a single lightweight Docker container where the model installs whatever domain tools it needs at runtime.
However, weaker models often fail to utilize these capabilities, tending to wander aimlessly without using the tools. To fix this, the researchers place the task context inside sandbox files rather than the prompt, forcing the model to actively explore the file system and learn efficient navigation strategies.
The results of this training strategy are significant, with strong models like Qwen3-Coder gaining over 24 points on math benchmarks. Beyond raw scores, the method is highly efficient, reducing token usage by up to 8 times for long contexts by offloading text to files.
This paper demonstrates that a general-purpose computer sandbox can serve as a universal tool for artificial intelligence, turning text generators into capable agents that produce real digital work. For a deeper dive into these methods, visualize the full paper at EmergentMind.com.