Self-Challenging Language Model Agents (2506.01716v1)

Published 2 Jun 2025 in cs.AI and cs.CL

Abstract: LLMs are quickly becoming the foundation for intelligent agents that are capable of using tools. However, training such agents is challenging because it requires human creation and annotation of a diverse set of tasks, tools, and evaluation criteria. In this paper, we propose the Self-Challenging framework for training an agent on high-quality tasks that are generated by itself. The agent first plays the role of challenger and generates a task after interacting with the given tools. The tasks take the form of a novel general class of problems termed Code-as-Task, which are defined by an instruction, a verification function and solution and failure cases which serve as tests, allowing to filter only for high-quality tasks. The agent then takes an executor role and trains on those tasks with reinforcement learning using the evaluation feedback as a reward. Evaluation on two existing multi-turn tool-use agent benchmarks, M3ToolEval and TauBench, shows the Self-Challenging framework achieves over a two-fold improvement in Llama-3.1-8B-Instruct, despite using only self-generated training data.

Summary

The paper introduces a framework where LLM agents self-generate and solve tasks using a Code-as-Task format.
It employs a dual-role setup with a Task Challenger and Task Executor that leverages reinforcement learning for improved tool use.
Experiments demonstrate up to a 20.2% absolute improvement in success rates across multiple tool-use environments over baseline methods.

This paper introduces the Self-Challenging Agent (SCA) framework, a novel approach for training LLM agents to use tools effectively without relying on manually created and annotated task datasets. The core idea is to enable an agent to generate its own training tasks, then learn to solve them.

The SCA framework assigns two roles to the LLM agent:

Task Challenger: The agent interacts with a given set of tools and an environment to explore possibilities and then generates a task.
Task Executor: The agent attempts to solve the tasks generated by the challenger, learning through reinforcement learning (RL) with feedback derived from task completion.

A key innovation is the "Code-as-Task" (CaT) formalism for defining these self-generated tasks. This addresses the challenge of ensuring generated tasks are feasible, verifiable, and appropriately difficult. Each CaT consists of:

Instruction: A natural language description of the task.
Verification Function: A piece of code (e.g., Python) that programmatically checks if a solution is correct. This function returns a reward (e.g., 0 or 1).
Example Solution: A code-based demonstration of how to solve the task. This helps ensure the task is feasible.
Failure Cases: Explicitly enumerated incorrect or suboptimal solutions in code. These help refine the verification function and ensure the task is non-trivial.

The CaT structure allows for automated filtering of low-quality tasks. For a task to be considered valid and used for training:

The verification function must be executable.
The provided example solution must pass the verification function.
All provided failure cases must not pass the verification function.

This filtering mechanism is crucial for maintaining high-quality training data. An example of a CaT is shown in Figure 2 of the paper, where a task in a retail environment involves an instruction to modify an order, a Python verification function checking database states, an example solution showing API calls, and failure cases representing incorrect attempts.

Implementation of the Task Challenger and Executor:

Task Challenger (π^task):
- Receives an initial prompt (e.g., API documentation, an example CaT structure as shown in Appendix Figure 10 for a retail environment).
- Interacts with tools (e.g., get_order_details(), get_product_details()) to gather information about the environment's state and capabilities.
- Outputs a task in the CaT format (instruction, verification function, solution, failure cases).
- The prompt for the challenger guides it to explore the environment first (e.g., "AFTER CHECKING THE USER DETAILS AND THEIR ORDER DETAILS") before formulating the task.

# Pseudocode for Challenger interaction
observation = initial_prompt_and_tool_docs
history = []
for _ in range(MAX_CHALLENGER_STEPS):
    action = challenger_LLM(observation, history) # e.g., "print(get_user_details(user_id='xyz'))"
    if "ANSWER:" in action: # Challenger decides to output a CaT
        task_CaT = parse_CaT_from_action(action)
        break
    tool_output = execute_tool_action(action)
    history.append((action, tool_output))
    observation = tool_output

# Automatic Filtering of task_CaT
if is_runnable(task_CaT.verification_function) and \
   task_CaT.verification_function(task_CaT.example_solution) == SUCCESS and \
   all(task_CaT.verification_function(fc) == FAILURE for fc in task_CaT.failure_cases):
    add_to_task_pool(task_CaT)
else:
    # Optionally provide feedback to challenger for revision (as mentioned in Appendix A)
    error_traceback = get_filtering_error()
    # Revise task or discard

Task Executor (π^exec):
- Receives an instruction from a valid CaT.
- Interacts with the environment and tools over multiple turns ( $a_{t+1} \sim \pi^\text{exec}(\cdot | o_{0:t+1}, a_{1:t})$ ).
- At the end of an episode (T steps), its final state $s_T$ and answer $a_T$ are evaluated by the CaT's verification function $\hat{R}_{\hat{c}}(s_T, a_T)$ , which provides a reward.
- This reward is used for RL training.

The paper explores two training settings for the executor:

Self-Improvement:
- The same LLM (e.g., Llama-3.1-8B) acts as both challenger and executor.
- The executor collects trajectories on the tasks it generated.
- It is trained using REINFORCE (Equation 2), which, due to the 0/1 reward structure, is equivalent to supervised fine-tuning (SFT) only on successful trajectories (Rejection Fine-Tuning).
- The loss function for REINFORCE is: $\mathcal{L} = -\sum_{\hat{c}\in \hat{\mathcal{C}}} \mathbb{E}_{a_{1:T} \sim \pi^{\text{exec}}} \left[ \hat{R}_{\hat{c}}(s_T, a_T) \sum_{t=0}^T \log \pi^{\text{exec}}(a_t|o_{0:t}, a_{0:t-1}) \right]$ .
Distillation:
- A (potentially weaker) challenger LLM (e.g., Llama-3.1-8B) generates CaT tasks.
- A stronger teacher LLM (e.g., Llama-3.1-70B) generates trajectories (solutions) for these tasks.
- A weaker student LLM (the executor, e.g., Llama-3.1-8B) is trained via SFT (cross-entropy loss, Equation 3) on all trajectories (both successful and failed) from the teacher. The paper notes that even failed trajectories from a stronger model can be beneficial.
- The SFT loss is: $\mathcal{L} = - \sum_{\{o_{0:T}, a_{0:T}\} \in \hat{\mathcal{D}}} \left[ \sum_{t=0}^{T} \log \pi^{\text{exec}}(a_t|o_{0:t}, a_{0:t-1}) \right]$ .

Experimental Setup and Results:

The framework was evaluated on four tool-use environments from M $^3$ ToolEval (Calculation, Web Browsing) and TauBench (Retail, Airline). The base model for fine-tuning was Llama-3.1-8B-Instruct.

Distillation: SCA improved Llama-3.1-8B's average success rate by 20.2% absolute (Pass@1) across all environments using only 800 synthetic tasks and 12k rollout trajectories from Llama-3.1-70B as the teacher. This significantly outperformed the PAE baseline (Proposer-Agent-Evaluator (2412.13194)).
Self-Improvement: Using Llama-3.1-8B as both challenger and executor, SCA doubled its average success rate from 12.0% to 23.5% (Pass@1). This also outperformed PAE, particularly in partially observable environments where PAE struggled (e.g., Airline, where PAE led to 0% Pass@4). SCA's advantage comes from the challenger actively interacting with the environment to gather information for task creation and the precise reward feedback from CaT's verification functions.

Key Ablations and Analyses:

Impact of CaT Components (Human Annotation Study, Figure 5):
- PAE (Instruction only, LLM-based evaluation): High False Negatives (FN) due to ambiguous/impossible tasks.
- CaT (Instruction + Verification Function only): Marginal improvement over PAE; verification functions can still be flawed or tasks infeasible. Pass rate for task generation: 47.7%.
- CaT (Instruction + Verification + Solution): Significantly reduces FNs by ensuring feasibility but increases False Positives (FP) as verification might be too lenient. Pass rate: 9.5%.
- Full CaT (Instruction + Verification + Solution + Failure Cases): Effectively eliminates FPs by using failure cases to filter out overly lenient verifiers. Still some FNs due to incomplete instructions (semantic nuances). Pass rate: 5.2%.
- This shows that each component of CaT progressively improves task quality and reward accuracy. The low final pass rate (5.2%) indicates the strictness and importance of the filtering.
RL Algorithms for Self-Improvement (Figure 4):
- Offline methods like Rejection Fine-Tuning (used in main results) and DPO provide significant gains.
- Online RL algorithms (PPO, GRPO) can achieve even better performance on the same SCA-generated tasks (e.g., PPO pushed Pass@1 from 20.3% to 43.2% in Calculation).
- However, online methods require more complex infrastructure and careful hyperparameter tuning (GRPO showed instability).
Task Diversity and Difficulty (Figure 6):
- CaT filtering can make the task distribution more homogeneous for weaker challenger models (Llama-3.1-8B) by removing poor generations.
- For stronger challenger models (Llama-3.1-70B), CaT filtering refines the task distribution while preserving diversity. This suggests filtering primarily removes invalid tasks rather than reducing inherent task variety from capable models.
Data Scaling (Figure 7):
- Scaling the number of unique synthetic tasks is more crucial for out-of-distribution generalization than scaling the number of rollout trajectories per task. Training on too few tasks (e.g., 200), even with many rollouts, can hurt test performance or lead to marginal gains. With sufficient task diversity (e.g., 800 tasks), increasing rollouts consistently improves test performance. This highlights the need for a large and diverse set of training tasks.

Practical Implementation Considerations:

Computational Cost: Challenger task generation and rollout generation are the main bottlenecks due to multi-turn interactions. Training (Rejection FT) and evaluation are relatively faster (Table 3, Appendix). For example, generating challenges for the Retail environment took 32 hours on 8xA100 80G.
Prompting: Prompts for the challenger need to be carefully designed to encourage exploration and adherence to the CaT format (see Appendix Figure 10-13). Stronger LLMs are more robust to prompt variations.
Environment Interaction: The challenger benefits from actively querying tools and the environment before generating a task, unlike methods that only use initial API documentation.
Filtering: The automatic filtering based on CaT components is critical. While it reduces the number of usable tasks (e.g., 5.2% pass rate for full CaT with Llama-3.1-8B), it ensures high quality.
Distillation Trajectories: Including unsuccessful trajectories from a strong teacher model during distillation is beneficial for the student model (Table 2, Appendix).

Limitations:

A non-trivial number of False Negative tasks persist even with CaT, mainly due to semantic nuances like ambiguity or missing information in instructions that are hard for the verification code to catch.
The improvements observed tend to be environment-specific rather than enhancing general, cross-environment agentic capabilities.

In conclusion, the Self-Challenging Agent framework with Code-as-Task offers a practical and automated way to generate high-quality training data for LLM agents, significantly improving their tool-use capabilities in both self-improvement and distillation settings, particularly by ensuring tasks are verifiable, feasible, and appropriately challenging through code-based definitions and filtering.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/jaseweston/status/1929719473952497797

https://twitter.com/jaseweston/status/1929719480441082297

https://twitter.com/rohanpaul_ai/status/1930026713066422423

https://twitter.com/MiloPrime_AI/status/1929758153614545027

https://twitter.com/arxivsanitybot/status/1930453395845918963

YouTube

Show All Videos