Tool-Augmented Reward Dataset (TARA)

Updated 14 November 2025

TARA is a comprehensive corpus offering paired question–answer examples with detailed tool-invocation traces and scalar reward labels.
It supports reward model training by integrating context-dependent tool use across domains like arithmetic, code execution, translation, and multi-step reasoning.
The dataset features a robust JSON schema and a multi-agent annotation workflow combining GPT-4 insights with human verification for trace and reward accuracy.

The Tool-Augmented Reward dAtaset (TARA) is a large-scale corpus designed to provide a comprehensive benchmark for training and evaluating reward models (RMs) that are augmented with explicit tool use. Developed in the context of advancing reward modeling via external environment access, TARA enables supervised alignment of LLMs with human judgment across complex domains requiring arithmetic computation, code execution, factual lookup, and multi-step reasoning. Each instance in TARA features paired question–answer examples, stepwise tool-invocation traces, and scalar reward labels, supporting the development of models that can both decide when to engage external APIs and produce interpretable, reliable reward signals.

1. Dataset Structure and Scope

TARA consists of question–answer comparison pairs, each enriched with full, stepwise tool-invocation traces and scalar preference scores. The dataset is structured to include both standard (“vanilla”) and tool-augmented examples, thereby enabling reward models to learn context-dependent decisions regarding tool use.

The composition of TARA is as follows:

Split	# of Comparisons	%
Train	13,604	~90
Test	1,469	~10

No separate validation split is provided; users may reserve a portion of the training set for validation as required.

TARA covers seven distinct tool types:

Calculator
Code Interpreter
Translator (Baidu API)
Google Search
Weather API (weatherapi.com)
Calendar (date calculations, day-of-week, add-days)
WikiSearch (Wikipedia lookup)
Multi-Tools: Sequential chains (e.g., Calendar + Weather)

Eight domains are addressed: arithmetic, code execution, translation, open- and closed-ended QA, knowledge lookup, date calculations, time-sensitive weather queries, and multi-step tool chains.

2. Annotation Workflow and Quality Controls

The data generation for TARA leverages a multi-agent pipeline involving both automated (GPT-4) and human-in-the-loop stages:

Negative Answer Generation: A GPT-4-based agent synthesizes a plausible but subtly mistaken “negative” answer given the question and a reference positive answer.
Tool-Invocation Trace Construction: A second GPT-4-based agent injects “Thought” (internal reasoning) and “Action” (tool call) steps, formulating explicit tool-calling decisions and parameters.
Human Execution: Human annotators execute each tool call as prescribed, capturing raw “Observation” outputs. This ensures accurate and deterministic API feedback.
Rationale and Reward Synthesis: A GPT-4 agent crafts stepwise “Rationale” entries synthesizing observations and produces preliminary scalar rewards.
Post-processing and Filtering: Examples are filtered to remove instances with invalid formatting, >3 tool invocations, missing function calls, or parsing errors. Negative answers are normalized to match positive answer style (spacing, punctuation).

Inter-annotator agreement is not explicitly measured, as preference labels are synthetically determined (positive vs. negative answer). Human review emphasizes tool trace validity and consistency.

3. Data Schema and Formats

Each TARA instance is expressed in JSON, with the following schema:

{
  "id": "tara_xxxxx",
  "question": "What is the weather like in New York on 2023-06-24?",
  "answers": {
    "positive": "The weather in New York on 2023-06-24 is Sunny.",
    "negative": "The weather in New York on 2023-06-24 is Raining."
  },
  "traces": {
    "positive": [
      {
        "step": 1,
        "thought": "Should I call a weather API to verify?",
        "action": {
          "tool": "Weather",
          "input": { "city": "New York", "date": "2023-06-24" }
        },
        "observation": "Sunny",
        "rationale": "Observation confirms the forecast was Sunny, matching the answer."
      }
    ],
    "negative": [
      {
        "step": 1,
        "thought": "Check weather before scoring.",
        "action": {
          "tool": "Weather",
          "input": { "city": "New York", "date": "2023-06-24" }
        },
        "observation": "Sunny",
        "rationale": "Observation contradicts 'Raining.' This answer is incorrect."
      }
    ]
  },
  "reward_scores": {
    "positive": 8.42,
    "negative": -2.17
  },
  "label": "positive"
}

Field Descriptions:

question: Natural-language prompt $x$
answers: “positive” and “negative” answer strings ( $y_1$ , $y_2$ )
traces: For each answer, a list of trace steps, including:
- step: Step index
- thought: Model’s reasoning (should/should not call a tool)
- action: Tool name and argument dict
- observation: Tool output
- rationale: Synthesized explanation integrating tool outcome(s)
reward_scores: Scalar scores $r_\theta(x,y_i,c_i)$ for each answer and its tool trace
label: Preferred answer ("positive" or "negative") — preference always positive over negative

Each comparison implies a preference label ( $y_1 \succ y_2$ ), allowing evaluation and supervised learning with pairwise ranking objectives.

4. Tool Integration Protocols

TARA supports explicit modeling of tool calls and their results via structured action-observation traces. Tools are specified via textual protocol:

Invocation format:
- “Action: <ToolName>”
- “Action Input: <JSON-style args>”
Supported tools and APIs:

Tool	Input Format	Return Type
Calculator	"expression"	Numeric or error
Code Interpreter	Python snippet & test cases	Pass/fail logs
Translator	{text, src_lang, tgt_lang}	Translated text
Google Search	Free-text query	Top snippet(s)
Weather	{city, date}	Textual weather report
Calendar	date, date1/date2, n	Day-of-week, diff, add
WikiSearch	Query	Wikipedia first paragraph
Multi-Tools	Chained API calls	Sequence of outputs

Observations are concatenated into the reasoning trace. The “Rationale” field synthesizes evidence from intermediate steps.

5. Reward Model Training Objectives

The dataset facilitates training of reward models that jointly perform preference grading and autoregressive tool-trace supervision. The central definitions and loss formulations are:

Scalar Reward Function:

$r_\theta(x, y, c_1 \ldots c_T) \in \mathbb{R}$

Pairwise Preference Probability:

$p_\theta(y_w \succ y_l \mid x) = \sigma\left( r_\theta(x, y_w) - r_\theta(x, y_l) \right)$ where $\sigma$ denotes the sigmoid.

Pair-wise Ranking Loss ( $L_{RM}$ ):

$L_{RM} = -\mathbb{E}_{(x, y_w, y_l)}\left[ \log \sigma \left( r_\theta(x, y_w) - r_\theta(x, y_l) \right) \right]$

Autoregressive Supervision for Tool Use and Reasoning:
- Tool prediction:
$L_{tool}(t) = -\sum_{tokens} \log P(\hat{a}_t \mid x, y, c_{<t})$ - Observation prediction:

$L_{Observation}(t) = -\sum_{tokens} \log P(o_t \mid x, y, c_{<t}, a_t)$ - Rationale prediction:

$L_{Rationale} = -\sum_{tokens} \log P(s_T \mid x, y, c_{<T})$
Total Training Objective:

$L_{total} = L_{RM} + \alpha \cdot \left( \sum_{t=1}^T \left[ L_{tool}(t) + \beta L_{Observation}(t) \right] + \omega L_{Rationale} \right)$

In TARA experiments, $\alpha=1$ ; $\beta$ and $\omega \in \{0, 1\}$ to enable ablation of the Observation and Rationale terms. When $\alpha=0$ , the formulation trivially reduces to standard, vanilla reward modeling free of tool use.

6. Evaluation Methodologies and Empirical Results

TARA supports multiple axes of evaluation for trained reward models:

Preference Ranking Accuracy: Proportion where $r(x, y_w) > r(x, y_l)$ over comparison pairs
Domain-wise Accuracy: Performance within each tool category and micro-averaged across all tasks
External Probing: Downstream evaluation via zero-shot MC accuracy on TruthfulQA, HH-RLHF, and Retarded-bar benchmarks
RLHF Performance: Win/tie/lose rates in human preference evaluations across four domains; PPO fine-tuning perplexity comparisons

Baselines and ablations are systematically reported:

RM(Bert-Large)
RM(Vicuna-7B)
Themis(Vicuna-7B)
Scale-up experiments with LoRA and Vicuna-33B
Mixed-tool vs. single-tool data splits
Absence of Observation or Rationale components

Key empirical results:

Themis(Vicuna-7B) achieves $\sim$ 94% accuracy on TARA in the mixed-tool setting, compared to 75% for RM(Vicuna-7B), an absolute gain of +17.7%.
On TruthfulQA (zero-shot), Themis improves by +7.3% over Gopher 280B.
RLHF with Themis realizes an average +32% human win rate compared to vanilla RMs.

7. Release, Expansion, and Usage Guidelines

The TARA dataset—including code, models, and Hugging Face dataset loading scripts—is publicly available at https://github.com/ernie-research/Tool-Augmented-Reward-Model. Each JSON instance corresponds to one comparison pair and can be decomposed into separate labeled examples for reward model training.

Typical usage involves:

Loading TARA via provided scripts
Reproducing RM and RLHF pipelines using the released models and code
Extending the corpus, e.g., by adding new APIs or introducing multi-turn dialogue, with adherence to the existing schema

A plausible implication is that TARA can serve as a foundation for further work in tool-integrated RLHF research, particularly in integrating new, real-world APIs and exploring multi-modal or multi-turn tool engagement. The design principles and annotation methodology provide a template for the generation and expansion of benchmark datasets supporting interpretable, tool-driven alignment across advanced LLM applications.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Tool-Augmented Reward Dataset (TARA).