Tool-Augmented Reward Dataset (TARA)
- TARA is a comprehensive corpus offering paired question–answer examples with detailed tool-invocation traces and scalar reward labels.
- It supports reward model training by integrating context-dependent tool use across domains like arithmetic, code execution, translation, and multi-step reasoning.
- The dataset features a robust JSON schema and a multi-agent annotation workflow combining GPT-4 insights with human verification for trace and reward accuracy.
The Tool-Augmented Reward dAtaset (TARA) is a large-scale corpus designed to provide a comprehensive benchmark for training and evaluating reward models (RMs) that are augmented with explicit tool use. Developed in the context of advancing reward modeling via external environment access, TARA enables supervised alignment of LLMs with human judgment across complex domains requiring arithmetic computation, code execution, factual lookup, and multi-step reasoning. Each instance in TARA features paired question–answer examples, stepwise tool-invocation traces, and scalar reward labels, supporting the development of models that can both decide when to engage external APIs and produce interpretable, reliable reward signals.
1. Dataset Structure and Scope
TARA consists of question–answer comparison pairs, each enriched with full, stepwise tool-invocation traces and scalar preference scores. The dataset is structured to include both standard (“vanilla”) and tool-augmented examples, thereby enabling reward models to learn context-dependent decisions regarding tool use.
The composition of TARA is as follows:
| Split | # of Comparisons | % |
|---|---|---|
| Train | 13,604 | ~90 |
| Test | 1,469 | ~10 |
No separate validation split is provided; users may reserve a portion of the training set for validation as required.
TARA covers seven distinct tool types:
- Calculator
- Code Interpreter
- Translator (Baidu API)
- Google Search
- Weather API (weatherapi.com)
- Calendar (date calculations, day-of-week, add-days)
- WikiSearch (Wikipedia lookup)
- Multi-Tools: Sequential chains (e.g., Calendar + Weather)
Eight domains are addressed: arithmetic, code execution, translation, open- and closed-ended QA, knowledge lookup, date calculations, time-sensitive weather queries, and multi-step tool chains.
2. Annotation Workflow and Quality Controls
The data generation for TARA leverages a multi-agent pipeline involving both automated (GPT-4) and human-in-the-loop stages:
- Negative Answer Generation: A GPT-4-based agent synthesizes a plausible but subtly mistaken “negative” answer given the question and a reference positive answer.
- Tool-Invocation Trace Construction: A second GPT-4-based agent injects “Thought” (internal reasoning) and “Action” (tool call) steps, formulating explicit tool-calling decisions and parameters.
- Human Execution: Human annotators execute each tool call as prescribed, capturing raw “Observation” outputs. This ensures accurate and deterministic API feedback.
- Rationale and Reward Synthesis: A GPT-4 agent crafts stepwise “Rationale” entries synthesizing observations and produces preliminary scalar rewards.
- Post-processing and Filtering: Examples are filtered to remove instances with invalid formatting, >3 tool invocations, missing function calls, or parsing errors. Negative answers are normalized to match positive answer style (spacing, punctuation).
Inter-annotator agreement is not explicitly measured, as preference labels are synthetically determined (positive vs. negative answer). Human review emphasizes tool trace validity and consistency.
3. Data Schema and Formats
Each TARA instance is expressed in JSON, with the following schema:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
{
"id": "tara_xxxxx",
"question": "What is the weather like in New York on 2023-06-24?",
"answers": {
"positive": "The weather in New York on 2023-06-24 is Sunny.",
"negative": "The weather in New York on 2023-06-24 is Raining."
},
"traces": {
"positive": [
{
"step": 1,
"thought": "Should I call a weather API to verify?",
"action": {
"tool": "Weather",
"input": { "city": "New York", "date": "2023-06-24" }
},
"observation": "Sunny",
"rationale": "Observation confirms the forecast was Sunny, matching the answer."
}
],
"negative": [
{
"step": 1,
"thought": "Check weather before scoring.",
"action": {
"tool": "Weather",
"input": { "city": "New York", "date": "2023-06-24" }
},
"observation": "Sunny",
"rationale": "Observation contradicts 'Raining.' This answer is incorrect."
}
]
},
"reward_scores": {
"positive": 8.42,
"negative": -2.17
},
"label": "positive"
} |
Field Descriptions:
question: Natural-language promptanswers: “positive” and “negative” answer strings (, )traces: For each answer, a list of trace steps, including:step: Step indexthought: Model’s reasoning (should/should not call a tool)action: Tool name and argument dictobservation: Tool outputrationale: Synthesized explanation integrating tool outcome(s)
reward_scores: Scalar scores for each answer and its tool tracelabel: Preferred answer ("positive" or "negative") — preference always positive over negative
Each comparison implies a preference label (), allowing evaluation and supervised learning with pairwise ranking objectives.
4. Tool Integration Protocols
TARA supports explicit modeling of tool calls and their results via structured action-observation traces. Tools are specified via textual protocol:
- Invocation format:
- “Action: <ToolName>”
- “Action Input: <JSON-style args>”
- Supported tools and APIs:
| Tool | Input Format | Return Type |
|---|---|---|
| Calculator | "expression" | Numeric or error |
| Code Interpreter | Python snippet & test cases | Pass/fail logs |
| Translator | {text, src_lang, tgt_lang} | Translated text |
| Google Search | Free-text query | Top snippet(s) |
| Weather | {city, date} | Textual weather report |
| Calendar | date, date1/date2, n | Day-of-week, diff, add |
| WikiSearch | Query | Wikipedia first paragraph |
| Multi-Tools | Chained API calls | Sequence of outputs |
Observations are concatenated into the reasoning trace. The “Rationale” field synthesizes evidence from intermediate steps.
5. Reward Model Training Objectives
The dataset facilitates training of reward models that jointly perform preference grading and autoregressive tool-trace supervision. The central definitions and loss formulations are:
- Scalar Reward Function:
- Pairwise Preference Probability:
where denotes the sigmoid.
- Pair-wise Ranking Loss ():
- Autoregressive Supervision for Tool Use and Reasoning:
- Tool prediction:
- Observation prediction:
- Rationale prediction:
- Total Training Objective:
In TARA experiments, ; and to enable ablation of the Observation and Rationale terms. When , the formulation trivially reduces to standard, vanilla reward modeling free of tool use.
6. Evaluation Methodologies and Empirical Results
TARA supports multiple axes of evaluation for trained reward models:
- Preference Ranking Accuracy: Proportion where over comparison pairs
- Domain-wise Accuracy: Performance within each tool category and micro-averaged across all tasks
- External Probing: Downstream evaluation via zero-shot MC accuracy on TruthfulQA, HH-RLHF, and Retarded-bar benchmarks
- RLHF Performance: Win/tie/lose rates in human preference evaluations across four domains; PPO fine-tuning perplexity comparisons
Baselines and ablations are systematically reported:
- RM(Bert-Large)
- RM(Vicuna-7B)
- Themis(Vicuna-7B)
- Scale-up experiments with LoRA and Vicuna-33B
- Mixed-tool vs. single-tool data splits
- Absence of Observation or Rationale components
Key empirical results:
- Themis(Vicuna-7B) achieves 94% accuracy on TARA in the mixed-tool setting, compared to 75% for RM(Vicuna-7B), an absolute gain of +17.7%.
- On TruthfulQA (zero-shot), Themis improves by +7.3% over Gopher 280B.
- RLHF with Themis realizes an average +32% human win rate compared to vanilla RMs.
7. Release, Expansion, and Usage Guidelines
The TARA dataset—including code, models, and Hugging Face dataset loading scripts—is publicly available at https://github.com/ernie-research/Tool-Augmented-Reward-Model. Each JSON instance corresponds to one comparison pair and can be decomposed into separate labeled examples for reward model training.
Typical usage involves:
- Loading TARA via provided scripts
- Reproducing RM and RLHF pipelines using the released models and code
- Extending the corpus, e.g., by adding new APIs or introducing multi-turn dialogue, with adherence to the existing schema
A plausible implication is that TARA can serve as a foundation for further work in tool-integrated RLHF research, particularly in integrating new, real-world APIs and exploring multi-modal or multi-turn tool engagement. The design principles and annotation methodology provide a template for the generation and expansion of benchmark datasets supporting interpretable, tool-driven alignment across advanced LLM applications.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free