On Effectiveness and Efficiency of Agentic Tool-calling and RL Training
Abstract: Tool-calling is a central component of modern LLM agents, equipping them with skills beyond their parametric knowledge. This paper studies tool-calling along two complementary axes: effectiveness, i.e., how this capability is measured, and efficiency, i.e., how it is learned. On effectiveness, we systematically analyze tool-calling evaluation pipelines and show that results can be highly sensitive to seemingly minor, often undocumented implementation choices including the random seed, system prompt, multi-turn template construction, and how prior interaction/reasoning history is carried forward. These choices can lead to substantial differences in reported performance, especially in multi-turn settings where without rigorous standardization, leaderboard rankings are unreliable. On efficiency, we examine standard reinforcement learning (RL) for tool-calling and identify two sources of computational waste: (i) during rollouts, many prompts produce no learning signal, and (ii) during policy updates, optimization incurs high computational cost. Guided by these findings, we introduce two techniques that accelerate RL-based tool-calling training, achieving substantial wall-clock speedup without degrading performance.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
A simple explanation of “On Effectiveness and Efficiency of Agentic Tool‑calling and RL Training”
1) What is this paper about?
This paper looks at two big things about “tool-calling” in AI assistants (like chatbots):
- Effectiveness: How do we fairly measure how good an AI is at using tools (like calling an API, using a calculator, or checking a calendar)?
- Efficiency: How can we train AIs to use tools faster and cheaper without losing quality?
The authors show that small, often ignored choices in testing can noticeably change a model’s score. They also present two training tricks that speed up learning in a big way.
2) What questions are the researchers trying to answer?
They focus on two main questions:
- How trustworthy are current test scores for tool-using AIs? Are the results sensitive to small choices, like the random seed (the “luck factor” in the system), the system prompt (the instructions you give the model), or how you format a multi-turn conversation?
- Where is training time being wasted when teaching AIs to use tools with reinforcement learning (RL), and how can we reduce that waste?
In simpler terms: Are we grading fairly, and are we practicing efficiently?
3) How did they study this? (Methods in everyday language)
To study effectiveness (fair scoring):
- They used a popular test called BFCL, which checks how well models handle many tools and tasks, including multi-turn conversations.
- They ran the same models multiple times with different random seeds (think: rolling the dice again), changed how they formatted conversations, tried keeping or dropping the model’s earlier “thinking,” and slightly adjusted the system prompt.
- They also compared training only on single-turn examples versus training on multi-turn examples to see which helps more.
To study efficiency (speeding up training):
- They examined RL training (a “practice and feedback” method) to see where time is being wasted.
- They found two big issues: 1) Many practice prompts don’t teach the model anything new (all attempts get the same score), so time spent on them is wasted. 2) The “update” step (where the model actually learns from its practice) takes a lot more time than generating practice attempts.
- They proposed two simple fixes:
- Online pre-rollout filtering: If a prompt has been solved perfectly for the last 1–2 training rounds, skip it for now. (Why re-do problems you always get right?)
- Variance-aware down-sampling: When you generate several attempts for a prompt, only learn from a few that are most different in score (the most informative successes and failures). This cuts the heavy learning cost while keeping the best learning signal.
4) What did they find, and why does it matter?
Here are the main takeaways, with plain-language reasons they’re important:
- Small testing choices can change scores a lot in multi-turn tasks.
- Changing the random seed changed multi-turn scores by up to about 3%.
- Changing how you format conversation history (using the model’s “native” chat format versus cramming everything into one message) made a big difference—about a 6–8% boost for the native format.
- Keeping the model’s earlier “thinking” in the conversation helped by about 2–5%.
- Slightly stronger system prompts (the instructions at the start) could improve scores as much as or more than RL fine-tuning.
- Why it matters: Leaderboards and comparisons can be unreliable if these details aren’t standardized and reported. Two teams might use different prompts or formats and think their method is better, when they’re really just grading differently.
- Training on multi-turn examples isn’t automatically better for multi-turn performance.
- In a controlled test with the same data budget, training only on multi-turn data did not improve multi-turn results and sometimes made them worse.
- Training on single-turn data improved single-turn scores and kept multi-turn performance roughly the same.
- Why it matters: Collecting high-quality multi-turn training data is hard and expensive. This suggests quality and alignment of conversations matter more than just “having multi-turn data.”
- RL training wastes a lot of compute unless you’re careful.
- Up to about 80% of prompts gave no useful learning signal (all attempts had identical scores).
- The model’s learning (“policy update”) step took much more time than generating attempts—often 3–5× more—and got worse as you generate more attempts per prompt.
- Why it matters: Lots of compute (time, money, energy) is burned with little learning. That slows progress and increases costs.
- Two simple tricks made RL training faster without hurting results.
- Skipping prompts that have been “all correct” for 1–2 rounds saved time and kept learning focused on harder prompts.
- Only updating the model on the most informative attempts (the ones with the biggest score differences) cut learning cost.
- Overall speedups: about 1.7× faster in single-turn settings and about 2.6× faster in multi-turn settings for the same or better accuracy.
- Why it matters: Faster training means lower costs, smaller carbon footprint, and more people can experiment and improve models.
- Improvements transferred to another test and didn’t harm general skills.
- Models improved on a different benchmark (ACEBench), not just BFCL.
- Scores on general tasks like reading comprehension and logic stayed the same or slightly improved.
- Why it matters: The gains weren’t just a “test trick,” and there’s no obvious tradeoff with other abilities.
5) Why this work matters in the bigger picture (implications)
- For fair comparisons, the community should standardize and report details like random seeds, system prompts, how conversation history is formatted, and whether the model’s earlier “thinking” is kept. Otherwise, leaderboard positions can be misleading.
- For cheaper and greener training, skip practice that doesn’t teach anything and focus learning on the most informative attempts. This helps teams move faster with fewer resources.
- For better multi-turn performance, simply adding more multi-turn data isn’t a silver bullet. The quality and structure of multi-turn conversations—and how they align with the test—likely matter more.
- Overall, the paper pushes the field toward more trustworthy testing and more efficient training, helping everyone get clearer answers and faster progress in building capable, tool-using AI assistants.
Knowledge Gaps
Below is a concise, actionable list of knowledge gaps, limitations, and open questions the paper leaves unresolved:
- Lack of a standardized, versioned evaluation protocol: no released reference scripts/seeds/prompts/templates to make cross-paper BFCL comparisons reproducible and fair.
- Seed sensitivity explored only on BFCL and a handful of models; no broader quantification across multiple benchmarks, model families, and larger seed sweeps with formal confidence intervals.
- Dependence on a single user simulator (Claude 4) with an ad hoc “stay USER” constraint; no comparison to human evaluation, alternative simulators, or analysis of simulator-induced variance.
- System-prompt sensitivity shown via one “stronger” variant; no systematic prompt sweep, prompt-tuning baselines, or canonical prompts defined per model/benchmark to decouple “method” from “prompt.”
- Multi-turn template findings (native vs context) not validated across more architectures, tokenizer/chat-template variants, or diverse tool-schema serializations (ordering, verbosity, formatting).
- Reasoning-history retention studied only in the “keep all” vs “drop all” extremes; no exploration of truncation/summarization strategies, memory modules, or context-budget trade-offs under hard token limits.
- Unclear handling of tool I/O verbosity and privacy; no study of how tool-output length/format contributes to context bloat, update cost, and performance.
- Multi-turn supervision quality identified as a bottleneck but not operationalized: no trajectory-quality metrics, noise-detection/denoising procedures, or data-cleaning pipelines evaluated.
- Training sets are small after processing (≈2.3k single-turn; ≈2.6k multi-turn); scalability of conclusions to larger, higher-quality corpora remains untested.
- Generalization to unseen tools/APIs and schema changes is not assessed; no OOD or “cold-start tool” evaluation.
- Reward design is under-specified: definition of “verified reward,” reward granularity (binary vs multi-level), sparsity, and token/turn-level credit assignment remain unclear.
- No comparison of GRPO to other RLHF variants (e.g., PPO, ReMax, RLAIF, DAPO) for tool-calling; algorithm-specificity of findings is unknown.
- Pre-rollout filtering only targets “all-correct” prompts; no strategy to identify and prioritize consistently-failed or borderline prompts where learning signal may be highest.
- Risk of reduced exploration and forgetting from skipping solved prompts is not rigorously evaluated; no periodic reintroduction schedule or anti-forgetting mechanism is proposed/tested.
- Choice of k in the pre-rollout filter is heuristic; no sensitivity analysis for k, or adaptive scheduling based on stability, reward noise, or curriculum stage.
- Variance-aware rollout down-sampling: m/n selection lacks principled guidance; no analysis of gradient bias/variance trade-offs, convergence guarantees, or robustness across reward distributions beyond binary.
- Efficiency profiling is tied to a specific framework (VERL) and hardware; portability of speedups across training stacks and GPU architectures is not demonstrated.
- Entropy analyses are descriptive; no causal links to learning dynamics or use as diagnostics (e.g., early stopping, curriculum triggers) are established.
- Evaluation focuses on accuracy; no reporting on latency, number of tool calls, token/compute cost, failure modes, rate-limit handling, or robustness to tool/API errors.
- Findings are based on small models (3B–8B); scaling behavior and whether sensitivities/efficiency gains hold for larger (≥30B) models are unknown.
- Safety impacts are not evaluated: no analysis of misuse risks, unsafe tool invocation, or interaction with guardrails under the proposed training/evaluation choices.
- “Live” evaluations are not clearly executed against real, failure-prone APIs; robustness to transient errors, non-deterministic tool responses, and network issues is untested.
- No released, canonical leaderboard protocol (prompts, templates, history rules, seeds, scoring) to stabilize multi-turn rankings and mitigate “prompt lottery” effects.
- Potential benchmark contamination from training data is mentioned but not audited; no leakage tests or controlled studies of training-data influence on BFCL/ACEBench.
- Cross-benchmark generalization is limited (ACEBench English only); multilingual, domain-shifted, and knowledge-shifted settings (e.g., T-Knowledge) remain unexplored.
- Interaction between system-prompt strength and RL gains is not disentangled; no factorial study to quantify additive vs synergistic effects of prompt design and RL.
- Simulator “role drift” is patched with a single instruction; robustness of evaluations to simulator prompt engineering and stochasticity is not characterized.
- Zero-variance prompt detection may be brittle to evaluator/reward noise; false all-correct classifications and their impact on filtering are not analyzed.
- Tool-schema verbosity contributes to high update cost but is not systematically optimized (e.g., schema compression, selective schema exposure, retrieval-on-demand).
- Conflict of interest and model selection: comparative evaluations could be broadened and independently replicated to mitigate bias concerns.
Practical Applications
Immediate Applications
Below are concrete applications that can be deployed now, derived from the paper’s evaluation findings and RL training efficiency techniques.
- Bold: Evaluation Standardization Toolkit for Tool-Calling (software platforms, academia, benchmarks)
- What: A lightweight “eval-kit” that enforces standardized, documented configurations for tool-calling benchmarks (BFCL/ACEBench), including multi-seed runs, model-native multi-turn templates, explicit “thinking history” policy, and fixed system prompts.
- Tools/workflows:
- A config manifest (seed count, prompts, templates, simulator settings);
- CI jobs (e.g., GitHub Actions) that run 3–10 seeds and report means/variance;
- A “config-hash” embedded in reports/leaderboards to ensure comparability;
- A patch to user simulators to prevent role drift (e.g., the added constraint sentence for Claude-4).
- Assumptions/dependencies: Access to benchmark datasets and the ability to control templates/prompts; tolerance for slightly higher eval cost due to multi-seed runs.
- Bold: Prompt Baseline Hardening Service (enterprise PromptOps, software)
- What: A controlled prompt management module that establishes “baseline” and “stronger” system prompts for multi-turn tool-calling and quantifies gains attributable to prompts vs. training.
- Tools/workflows: Prompt registry with versioning; A/B harness that isolates prompt-induced deltas before/after training; guardrails to prevent overfitting prompts to specific benchmarks.
- Assumptions/dependencies: Centralized prompt governance; prompt versions logged in evaluation manifest.
- Bold: Native Multi-Turn Template Canonicalization (software, robotics)
- What: Adopt and enforce the model’s native chat template for multi-turn interactions rather than concatenated “context” templates to unlock 6–8% accuracy gains observed in BFCL multi-turn.
- Tools/workflows: Template adapters per model family (e.g., Qwen/Llama); linting that rejects non-native formatting in evaluation and production.
- Assumptions/dependencies: Access to and use of model-specific chat templates; capacity to refactor existing stacks to native formatting.
- Bold: Thinking-History Retention Policy (software, robotics, customer support)
- What: Keep prior reasoning traces across turns to improve multi-turn tool-calling by ~2–5% for reasoning-oriented models; pair with token-budgeting heuristics or summarization for long contexts.
- Tools/workflows: Context window budgeting; selective summarization of older turns; telemetry on reasoning-token footprint.
- Assumptions/dependencies: Sufficient context window; cost controls; models that benefit from retained reasoning.
- Bold: Reproducibility Checklist and Reporting (academia, benchmarks, policy-facing evaluations)
- What: Paper and internal-report checklists requiring seeds, template type, thinking-history policy, prompts, and evaluation harness versions to reduce benchmark lottery effects.
- Tools/workflows: Artifact checklist embedded in submission templates and internal RFCs; replication packs with config files.
- Assumptions/dependencies: Journal/conference or organizational buy-in to require/check these fields.
- Bold: RL Training Efficiency Plugin for VERL/TRL (model labs, cloud ML)
- What: Drop-in implementation of the two techniques proposed: (1) online pre-rollout filtering of “all-correct” prompts using short-horizon streaks, and (2) variance-aware rollout down-sampling for GRPO/PPO.
- Tools/workflows:
- A small cache tracking per-prompt “all-correct” streaks with k=1–2;
- Reward-variance selector that backpropagates only high-contrast rollouts (m < n).
- Assumptions/dependencies: RL setup with verified reward signals; ability to log per-prompt rollout outcomes; support for subsetting rollouts during updates.
- Impact: 1.7× speedup (single-turn) and 2.6× (multi-turn) wall-clock training without degrading accuracy.
- Bold: Data Strategy Shift: Single-Turn First (startups, education, software)
- What: Prioritize single-turn supervision for tool-calling (SFT or RL) to improve single-turn performance while preserving multi-turn accuracy, deferring expensive multi-turn data collection until curation quality is high.
- Tools/workflows: Policy-based data filters to remove trivially easy/hard prompts; progressive inclusion of curated multi-turn trajectories.
- Assumptions/dependencies: Availability of single-turn datasets; acceptance that multi-turn data can be noisy and may not immediately help.
- Bold: Cost and Carbon Reduction Playbook (cloud, energy-conscious orgs)
- What: Apply the RL efficiency methods to reduce GPU-hours and energy use, and report compute/energy alongside results to align with sustainability goals.
- Tools/workflows: GPU-hour dashboards; carbon accounting (e.g., per-epoch emissions with and without filtering/down-sampling).
- Assumptions/dependencies: Visibility into training-time breakdown; organizational incentives for sustainability reporting.
- Bold: Regulated-Deployment Prechecks for Tool-Calling Agents (healthcare, finance, legal ops)
- What: Pre-deployment gates that require standardized evaluation (multi-seed, prompt/template manifest), reproducibility reports, and A/B performance stability across seeds.
- Tools/workflows: Compliance checklists; reproducibility CI; shadow deployments in sandboxed tool environments.
- Assumptions/dependencies: Access to representative domain tools/APIs; internal risk governance.
- Bold: Course Modules and Lab Assignments on Evaluation Fragility (academia)
- What: Teaching materials that demonstrate seed sensitivity, template effects, and prompt-induced gains; encourage best-practice reporting.
- Tools/workflows: Reproducible notebooks with BFCL-like tasks; assignments that require config manifests.
- Assumptions/dependencies: Open-source models and datasets; suitable compute for small seed sweeps.
Long-Term Applications
The following applications will benefit from further research, scaling, standardization, or ecosystem development.
- Bold: Industry Standard for Agentic Tool-Calling Evaluation (policy, standards bodies, benchmarks)
- What: A NIST/ISO-like specification that mandates multi-seed reporting, standardized system prompts, native template usage, thinking-history policies, and a signed evaluation manifest.
- Tools/products: Certification programs; reproducibility scores on leaderboards; public registries of manifests.
- Dependencies: Cross-organization coordination; benchmark maintainers’ support; legal frameworks for claims.
- Bold: Leaderboard Governance with Reproducibility Guarantees (benchmarks, platforms)
- What: Leaderboards that reject submissions lacking config manifests; show confidence intervals; re-run spot checks under controlled seeds to validate claims.
- Tools/products: Repro runner infrastructure; config-hash verification; submission schemas.
- Dependencies: Funding/sponsorship for compute; community buy-in.
- Bold: High-Quality Multi-Turn Trajectory Curation and Repair (academia, data vendors)
- What: Pipelines that detect error propagation, ambiguity, and misaligned “correct” steps in multi-turn traces; automatic trajectory repair or re-generation (e.g., via simulated agent–human interplay).
- Tools/products: Trajectory quality metrics, repair heuristics, and alignment scoring; dataset release policies that include quality audits.
- Dependencies: Reliable simulators/annotators; domain-specific tool schemas; additional research on trajectory alignment.
- Bold: Adaptive RL Orchestration Beyond Heuristics (model labs, cloud ML)
- What: A controller that continuously estimates prompt informativeness and adjusts (a) sampling vs. updates, (b) number of rollouts n and backprop subset m, and (c) curriculum over prompts.
- Tools/products: Budget allocators; dynamic schedulers integrated with VERL/TRL; reward-shaping that leverages zero-variance prompts (e.g., entropy-guided methods).
- Dependencies: Robust online metrics; stability analyses; scheduler–framework integration.
- Bold: Hardware/Compiler Support for Update-Dominant Workloads (chip vendors, frameworks)
- What: Kernel and compiler-level optimizations tailored to tool-calling’s update-heavy profile (long sequences with schemas/history), e.g., gradient checkpointing tuned for multi-turn contexts or operator fusion for long-token backprop.
- Tools/products: Framework patches in PyTorch/JAX; specialized memory planners for long context backprop.
- Dependencies: Vendor prioritization; benchmark baselines to quantify gains.
- Bold: Sector-Specific Evaluation and Certification (healthcare, finance, public sector)
- What: Domain-focused suites where tool schemas, prompts, and history policies are fixed and audited; certification badges required for deployment in high-stakes settings.
- Tools/products: EHR tool-call sandboxes; financial compliance tool-call harnesses; documented failure modes under seed variance.
- Dependencies: Regulators and professional bodies defining acceptable practices; shared testbeds with synthetic but realistic data.
- Bold: Carbon-Aware RL Training Schedulers (energy, cloud)
- What: Schedulers that exploit the paper’s efficiency insights to shift update-heavy steps to low-carbon windows and throttle redundant rollouts automatically.
- Tools/products: Integration with grid carbon APIs; per-stage energy profiling to inform scheduling.
- Dependencies: Accurate per-stage telemetry; cloud primitives for carbon-aware scheduling.
- Bold: Context-Economy Managers for Multi-Turn Agents (software, robotics)
- What: Policies that retain, compress, or discard “thinking history” to balance accuracy vs. cost, with automatic validation that accuracy doesn’t regress beyond a threshold.
- Tools/products: Summarization modules with accuracy monitors; budget-aware context managers.
- Dependencies: Better summarization that preserves tool-use cues; guardrails to detect regressions.
- Bold: Prompt Governance with Generalization Audits (platforms, enterprise PromptOps)
- What: Systems that evaluate whether prompt gains carry over to distinct benchmarks (e.g., BFCL → ACEBench), flagging potential prompt overfitting.
- Tools/products: Cross-benchmark audit pipelines; drift detectors for prompt-induced behavior shifts.
- Dependencies: Access to multiple diverse benchmarks; organizational culture around prompt accountability.
- Bold: Production Observability for Tool-Calling Fidelity (software, safety)
- What: Runtime logging/telemetry that captures the exact evaluation-like config (prompt, template, history policy) used per interaction, enabling audit and forensic analysis of failures.
- Tools/products: Structured event schemas; privacy-preserving logs of tool I/O and decisions; redaction pipelines.
- Dependencies: Privacy/compliance alignment; storage and governance for logs.
- Bold: Education Standards for Reproducible LLM Agent Experiments (academia, policy)
- What: Curricular standards encouraging multi-seed reporting and manifest sharing; funding policies that require reproducibility artifacts for agentic tool-calling research.
- Tools/products: Shared teaching kits; reproducibility grant requirements.
- Dependencies: Adoption by universities and funding agencies.
Notes on cross-cutting assumptions:
- The RL efficiency methods require reliable, verifiable reward signals per rollout and are most compatible with PPO/GRPO-like training loops that can selectively update on rollout subsets.
- Gains from native templates and thinking-history retention depend on model families that adhere to strong chat-template conventions and can leverage prior reasoning.
- Evaluation reliability hinges on benchmark owners agreeing to, or at least documenting, defaults for prompts, templates, and history handling.
- For regulated sectors (healthcare/finance), sandboxed tools and auditable logs are prerequisites for adopting these practices.
Glossary
- ACEBench: A benchmark suite for evaluating tool-learning and function-calling abilities of language-model agents. "the evaluation on ACEBench (Chen et al., 2025)"
- advantage: In policy-gradient RL, a measure of how much better a sampled action (or sequence) is than a baseline, used to weight updates. "to stabilize advantage estimates."
- all-correct: A prompt whose sampled rollouts all achieve maximum reward, yielding no learning signal for gradient updates. "we skip prompts that have been all-correct for the past k epochs."
- benchmark alignment: The degree to which training data and trajectories match what a benchmark measures, affecting transfer and reported gains. "bottlenecked by trajectory quality and benchmark alignment"
- BFCL: The Berkeley Function Calling Leaderboard, a widely used benchmark for tool-calling evaluation. "the BFCL (Patil et al., 2025) benchmark"
- chain-of-thought: Explicit intermediate reasoning traces produced by a model, often retained across turns. "Reasoning content (e.g., chain-of-thought or > blocks)"
chat template: A model- or framework-specific message formatting scheme for role-based dialogue. "chat-template tokens"
- clip: The clipping operator that bounds a value within a range in PPO-style objectives to stabilize training. "clip(Pi,k+1,j,1-€, 1+€)"
- context template: A multi-turn formatting strategy that concatenates the entire dialogue history into a single user turn. "Context template."
- Entropy dynamics: The evolution of token-level uncertainty during training, tracked to analyze convergence and exploration. "Entropy dynamics."
- environment observation: External feedback available to the agent during interaction, such as tool outputs. "environment observation (e.g., tool outputs)"
- GEPA: A reflective prompt-evolution method used to improve system prompts without RL. "We do not tune prompts with methods such as GEPA (Agrawal et al., 2025)"
- GRPO: Group Relative Policy Optimization, an RL algorithm variant used for training tool-calling policies. "Under GRPO, given si,k the policy"
- group normalization: Normalizing rewards within a rollout group before computing advantages to stabilize gradients. "After group normalization, the ad- vantage for rollout j is"
- leaderboard rankings: Ordered comparisons of models on benchmarks, which can be unreliable without standardized evaluation. "leaderboard rankings are unreliable."
- max-variance rollout down-sampling: Selecting a subset of rollouts that maximizes reward variance to reduce update compute while preserving learning signal. "max-variance rollout down-sampling (Xu et al., 2025)"
- multi-turn supervision: Training with trajectories that span multiple conversational turns, including intermediate tool use and reasoning. "Multi-turn supervision does not improve multi-turn BFCL"
- multi-turn template: The formatting scheme used to represent and pass multi-turn dialogue history to the model. "construction of the multi-turn template."
- native multi-turn template: Using a model’s built-in role-based chat formatting for multi-turn history rather than concatenated context. "using the native multi-turn template yields a consistent ~6-8% gain"
- non-stationarity: The property that the data distribution (e.g., prompt difficulty) changes over training time, affecting filter reliability. "This non-stationarity is important:"
- ORPO: A policy-optimization objective variant referenced in comparisons of training dynamics. "Vanilla ORPO"
- parametric knowledge: Information stored in a model’s parameters as opposed to obtained via external tools or context. "LLM parametric knowledge."
- policy updates: The optimization phase where model parameters are adjusted using gradients computed from collected rollouts. "policy updates, where the model is optimized on the collected rollouts."
- PPO: Proximal Policy Optimization, a popular on-policy RL algorithm used as a baseline for training. "PPO (Schulman et al., 2017)"
- probability ratio: The ratio of current-policy to old-policy likelihoods for a sampled rollout, used in clipped objectives. "is the probability ratio between the current and old policies."
- random seeds: Pseudorandom initializations that can significantly affect stochastic training and evaluation outcomes. "across ten different random seeds"
- reward variance: The variability of rewards across rollouts for a prompt; higher variance typically provides stronger learning signal. "selected to maximize reward variance"
- role drift: When a simulated user or agent deviates from its intended role (e.g., a user replying like an assistant). "role drift (assistant-like replies)"
- rollout generation: The sampling phase where the policy produces trajectories for prompts to collect rewards. "rollout gen- eration, where the policy samples tool-call trajectories for each prompt"
- serialization: The way dialogue history and reasoning are encoded into text for the model, affecting behavior. "how the interaction history is serialized."
- system prompt: The instruction block that conditions overall agent behavior and tool-use policy. "default BFCL system prompt"
- thinking history: The accumulated intermediate reasoning (e.g., think blocks) kept across turns in multi-turn interactions. "retaining thinking history"
- tool I/O: The inputs and outputs exchanged with external tools during agent operation. "including intermediate reasoning and tool I/O"
- tool schemas: Structured definitions of available tools (functions), including their names and argument formats. "Ti encodes the tool schemas."
- tool-calling: The capability of LLM agents to invoke external tools or APIs as part of solving tasks. "Tool-calling (or function calling) has become a cornerstone"
- trajectory: The ordered sequence of messages, actions, and observations forming a conversation or episode. "We denote the trajectory prefix"
- user simulator: An automated component that generates user messages to evaluate agents in a controlled way. "Claude 4 as the user simulator"
- VERL framework: An RL(RLHF) training framework used to implement the experiments in the paper. "implemented in the VERL framework"
- variance-aware rollout down-sampling: Updating the policy using a selected subset of rollouts chosen to maximize reward variance. "Variance-aware rollout down-sampling:"
- wall-clock time: Real elapsed time for training or evaluation, used to measure practical efficiency. "Wall-clock training time"
Collections
Sign up for free to add this paper to one or more collections.