OpenHands CodeAct Framework

Updated 20 May 2026

OpenHands CodeAct is a code-centric framework that operationalizes LLM agent interaction through direct Python code execution and self-debugging loops.
The architecture integrates modules for state perception, memory management, LLM interfacing, and code execution, facilitating autonomous tool use and iterative reasoning.
Empirical results across diverse benchmarks demonstrate CodeAct’s advantages in flexibility, safety, and performance, making it a robust solution for ML and scientific workflows.

OpenHands CodeAct is a code-centric agentic framework for LLM agents that operationalizes tool use, reasoning, and environment interaction via executable Python actions. Originating from Wang et al. (2024), CodeAct and its integration in the OpenHands agent platform represent a class of systems where LLM outputs are not restricted to JSON or template-based tool schemas, but instead directly generate, execute, and self-debug code in an iterative loop. This approach improves expressivity, tool coverage, compositionality, and enables autonomous ML and scientific workflows by merging LLM planning with computational subroutines. OpenHands and CodeAct have been empirically validated across domains including ad hoc teamwork, scientific agent benchmarks, complex tool pipelines, and safety-critical scenarios.

1. Core Principles and Agent Architecture

OpenHands CodeAct leverages an interactive, code-driven reasoning loop where the agent emits tools calls as executable Python code blocks. The canonical architecture comprises four interacting modules: perception (for state extraction), memory manager (structured past event storage and retrieval), LLM interface (prompt construction, chain-of-thought plus code synthesis), and a code executor (Python REPL or shell). The agent alternates between natural language planning and direct code actions, using interpreter feedback (outputs, runtime errors) to guide subsequent reasoning and enable autonomous self-debugging (Shi et al., 2023, Wang et al., 2024, Zhang et al., 23 Oct 2025).

At each turn $t$ , the interaction is as follows:

Context (history, state) is passed to the LLM, which emits a mix of chain-of-thought and executable code.
The code is executed in a sandbox; results and tracebacks are fed back as new observations.
On error, the agent either self-remediates via reflexive re-prompting or escalates for user input.
The loop halts on either explicit end-of-task emission or after reaching a system-configured budget of steps.

This modular design is formalized as:

$a_t = f_\theta(s_t), \quad s_{t+1} = \operatorname{exec}(a_t, s_t)$

where $a_t$ is the new code action and $s_t$ the current state (full conversational and execution history) (Wang et al., 2024).

2. Action Representation, Tool Integration, and Prompting

Unlike text/JSON-only agents, CodeAct unifies environment interaction—tool calls, data transformations, file operations—into executable Python. This native code action space enables:

Arbitrary composition (for-loops, functions, conditional logic)
Immediate, granular feedback for debugging and correction
Direct leverage of the Python and PyPI ecosystem (e.g., ML pipelines, plotting, optimization)

Tools are often registered as Python-callable APIs (i.e., via decorators such as @tool), and made discoverable through the system prompt or an automated tool-retrieval mechanism. In OpenHands, actions span Python REPL, Bash shell, web-browser queries, or custom domain APIs (Chen et al., 2024, Le et al., 24 Sep 2025). Prompt structuring instructs the LLM to generate code inside syntactic delimiters (<execute> or <code name="">) and finish with explicit answer tags when complete.

Iterative execution, self-correction, and persistence of interpreter context are handled internally:

for t in range(T_max):
    code = LLM.generate_code_action(history)
    result = PythonInterpreter.run(code)
    history.append({"action": code, "obs": result})
    if LLM.identifies_answer(code):
        break

(Wang et al., 2024, Zhang et al., 23 Oct 2025).

3. Training Methodologies, Fine-Tuning, and Optimization

A central element in the evolution of CodeAct systems is supervised and RL-based fine-tuning on code-centric interaction traces. The CodeActInstruct dataset—7,139 multi-turn trajectories encompassing information seeking, math, code generation, tabular, and robot-control tasks—forms the primary instruction-tuning resource. This is supplemented with large-scale conversational data to maintain general LLM capability (Wang et al., 2024). Models such as Llama2 and Mistral-7B are finetuned with sequence lengths up to 16,384, AdamW optimizer, and only assistant-side loss computation.

For further specialization, frameworks like ToolBrain support reinforcement learning via Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), leveraging custom Pythonic reward callables or LLM-as-a-judge systems. Knowledge distillation allows large teacher models to generate high-quality traces for initializing smaller agent variants. Resource efficiency is maximized via QLoRA and bitsandbytes quantization (Le et al., 24 Sep 2025).

In parallel, CodeACT (Editor’s term: “Code Adaptive Compute-efficient Tuning”) introduces data pruning strategies: Complexity and Diversity Aware Sampling (CDAS) selects high-value training examples, and dynamic batch packing reduces token waste—yielding up to +8.6% absolute improvement on HumanEval and cutting training time by 78% (Lv et al., 2024).

4. Applications and Empirical Benchmarking

OpenHands CodeAct has been extensively evaluated across ML, scientific, and tool-use agentic benchmarks:

ML-Dev-Bench: OpenHands achieves 60% success (15/25), including 100% on dataset handling, 83% on model training, but 0% on performance optimization (Padigela et al., 3 Feb 2025).
ScienceAgentBench: CodeAct’s modular, multi-tool strategy is upper-bounded by the LLM’s tool-use proficiency (SR=24.5% with Claude-3.5-Sonnet), and its effective VER is 88.2%. Self-debug frameworks can outperform CodeAct on deep model development, demonstrating a cost-performance tradeoff (Chen et al., 2024).
API-Bank/M³ToolEval: CodeAct outperforms template- or JSON-action baselines on 12/17 LLMs and lifts 7B models by up to +24% over alternative tuning approaches (Wang et al., 2024).
Avalon Game (ad hoc teamwork): CodeAct achieves higher QuestWin rates and faster adaptation than semantic CoT and ReAct baselines; e.g., CodeAct QuestWin=0.593 vs. CoT=0.547 (Shi et al., 2023).

The framework supports advanced capabilities, including multi-tool ML pipelines and pipeline-level self-debug (e.g., full scikit-learn + matplotlib pipelines), and code-based adaptation in multi-agent settings via belief state tracking and soft Bayesian updates (Shi et al., 2023).

5. Safety, Permission Modeling, and Reliability

OpenHands CodeAct differentiates itself by integrating an interactive, ask-to-continue gating mechanism prior to high-risk or potentially out-of-scope tool calls. At a protocol level, every destructive or ambiguous operation is preceded by an explicit user consent dialog. Formally:

$\mathrm{Exec}(op) = \begin{cases} \text{perform}(op), &\text{if } op \in A \ \text{perform}(op) \times \mathrm{consent}(op), &\text{otherwise} \end{cases}$

where $A$ is the implicitly authorized action subset and $\mathrm{consent}(op)\in\{0,1\}$ is collected at runtime (Qu et al., 18 May 2026).

Empirical results on the OverEager-Gen benchmark show OpenHands CodeAct sustains a <5% overeager rate (permissive frameworks ≈5.4–27.7%), with significant reductions (11.6–26.6 percentage points versus Claude Code on shared base models). Paired consent ablation confirms gating effectiveness, and cross-framework comparisons are statistically robust (Fisher $p \leq 10^{-5}$ ). Rule judges exhibit precision 0.76, recall 1.00, κ=0.73 against human annotation (Qu et al., 18 May 2026).

6. Structural Limitations and Advances

Intrinsic limitations of the vanilla CodeAct loop include fragmented local reasoning (greedy code-by-code generation), instability across interaction turns, ambiguous ground truth at step level, and insufficient supervision per turn (Ni et al., 2024). The "Tree-of-Code" framework (ToC) and the "CodeProgram" paradigm extend CodeAct by promoting global program-level planning, reflective error-driven branching, and tree-structured self-supervision. ToC achieves up to +17.7 percentage points accuracy over CodeAct and converges with fewer than 1/4 of the turns on complex benchmarks, supporting robust data generation for subsequent SFT or ReFT (Ni et al., 2024). This suggests adopting global planning and ensemble-based solution selection could further strengthen OpenHands CodeAct performance in practice.

7. Reproducibility, Open-Source Availability, and Best Practices

OpenHands CodeAct is provided as a subpackage (openhands.code_act), with modular APIs to instantiate agents, register tools, and manage interpreter execution cycles. Typical use involves CodeActAgent, configurable with model backends (e.g., Mistral-7B-CodeAct), environment interfaces, tool sets, and multi-turn behavior policies (Wang et al., 2024). For training or evaluation:

Data sampling (via CDAS), tokenization, and dynamic packing proceed via standard PyTorch/Transformers workflows.
RL and distillation use frameworks such as ToolBrain for flexible learning, quantization, and tool retrieval (Le et al., 24 Sep 2025).
Safety is enforced through consent-interactive execution, and multi-tool action logs support external audit (Qu et al., 18 May 2026).

Documentation and demonstration scripts are typically provided at the package and agent level, and benchmarks are reproducible at the scenario and audit-bundle granularity.

OpenHands CodeAct thereby constitutes a mature, code-centric agentic platform that leverages executable code as the primary vehicle for interactive tool use, modular reasoning, knowledge integration, and safety-compliant automation, with robust empirical performance and extensible support for fine-tuning, RL, and scientific or ML development tasks (Wang et al., 2024, Chen et al., 2024, Padigela et al., 3 Feb 2025, Qu et al., 18 May 2026).