TRACE Benchmark Datasets

Updated 16 November 2025

TRACE Benchmark Datasets is a benchmark suite that rigorously evaluates continual learning in LLMs by testing domain-specific knowledge, multilingual processing, code generation, and mathematical reasoning.
It employs a unified JSON formatting and automatic evaluation protocol, ensuring balanced and reproducible assessment across eight diverse tasks.
The benchmark quantifies performance trade-offs and catastrophic forgetting while demonstrating the benefits of reasoning-augmented continual learning in large language models.

TRACE (Task Robustness and Adaptation via Continual Evaluation) is a benchmark suite designed to rigorously evaluate continual learning in aligned LLMs. Existing continual learning benchmarks are considered insufficiently challenging for modern aligned LLMs due to their simplicity and the possibility of prior model exposure during instruction tuning. TRACE addresses this gap through a collection of eight diverse datasets, each testing distinct, high-difficulty competencies: domain-specific knowledge, multilingual processing, code generation, and mathematical reasoning. Crucially, TRACE introduces a unified data formatting and automatic evaluation protocol, enabling comprehensive and reproducible assessment of catastrophic forgetting and adaptation in LLMs during sequential fine-tuning.

1. Composition and Structure of TRACE

TRACE consists of eight datasets, each contributing 5,000 training and 2,000 test examples, for an aggregate of 40,000 training and 16,000 test instances. These datasets cover the following domains and tasks:

Dataset	Domain/Task	Evaluation Metric
ScienceQA	Multi-hop science QA	Accuracy
FOMC	Monetary stance (finance)	Accuracy
MeetingBank	Long-context summarization	ROUGE-L
C-STANCE	Chinese stance detection	Accuracy
20Minuten	German text simplification	SARI
Py150	Python code completion	Edit-distance sim.
NumGLUE-cm	Arithmetic reasoning	EM/Accuracy
NumGLUE-ds	Discrete subtraction	EM/Accuracy

The data for each task is standardized into a three-field JSON schema:

1	{"instruction": ..., "input": ..., "output": ...}

This enables all datasets to be fed into a prompting engine using the template:

1	<instruction>\n<input>

with the model expected to produce the <output>.

Domain-Specific Tasks

ScienceQA involves 4-way multiple-choice questions requiring multi-hop reasoning, with a chain-of-thought (CoT) option in the answer.
FOMC targets financial policy classification from central bank transcripts, reformulated into three labels (A: dovish, B: hawkish, C: neutral).
MeetingBank provides city-council meeting transcripts (mean 2,853 words) for abstractive summarization.

Multilingual Tasks

C-STANCE addresses stance detection in Chinese social media, following a target-based subtask with labels and prompts in Chinese.
20Minuten comprises German news articles requiring document-level text simplification.

Code Completion

Py150 presents Python files for next-line code completion, scored by edit-distance similarity.

Mathematical Reasoning

NumGLUE-cm: Arithmetic, coin-math-type problems (with rationales generated via GPT-4).
NumGLUE-ds: Discrete subtraction exercises mirroring school mathematics.

For all datasets, preprocessing enforces label balance where appropriate, removes multi-modal content, and ensures consistency (e.g., truncation for long texts).

2. Unified Evaluation Protocols

TRACE employs a single, automatic evaluation framework. Each task is scored with an established metric:

Accuracy: ScienceQA, FOMC, C-STANCE, NumGLUE-cm, NumGLUE-ds
ROUGE-L: MeetingBank
SARI: 20Minuten
Edit-distance similarity: Py150

All metrics reside in a common schema, yielding an aggregated TRACE score:

$\mathrm{TRACE} = \sum_{i=1}^8 w_i \cdot \text{Metric}_i$

with $w_i = 1/8$ for simple averaging. This produces:

$\mathrm{TRACE}_{\mathrm{avg}} = \frac{1}{8} \sum_{i=1}^8 \text{Metric}_i$

In continual-learning settings, two principal metrics are used for each sequential stage $t$ :

Overall Performance after $t$ tasks (OP $_t$ ):

$OP_t = \frac{1}{t} \sum_{i=1}^t R_{t,i}^D$

Backward Transfer ( $\mathrm{BWT}_t$ ):

$\mathrm{BWT}_t = \frac{1}{t} \sum_{i=1}^{t-1} (R_{t,i}^D - R_{i,i}^D)$

where $R_{t,i}^D$ is the model's performance on task $i$ after training on task $t$ .

To measure the preservation (or degradation) of original model capabilities post-fine-tuning, TRACE introduces delta measures:

$\Delta R_t^G$ : Change in general ability (e.g., MMLU, BBH)
$\Delta R_t^I$ : Change in instruction following (e.g., Self-Instruct, LIMA)
$\Delta R_t^S$ : Change in safety (e.g., CoNa)

All delta measures are averages over a corresponding benchmark set and compare post-training performance to the initial baseline.

3. Catastrophic Forgetting Phenomena

Sequential fine-tuning of aligned LLMs on TRACE datasets induces pronounced catastrophic forgetting. For instance, after exposure to all eight tasks, LLaMA-2-13B-Chat's arithmetic accuracy on GSM8K-like tasks dropped from approximately 28.8% to about 2%. Simultaneously, $\Delta R_t^G$ (general ability) is strongly negative (up to –13 points on larger models), and instruction-following, $\Delta R_t^I$ , declines by 5–15 points. Safety, as measured by $\Delta R_t^S$ , remains close to baseline. Incremental training reveals a trade-off between gaining task-specific performance and losing general capabilities: while $OP_t$ rises, $\Delta R_t^G$ decreases following an inverse-power law,

$\Delta R^G \simeq \alpha \cdot (\text{epoch})^{-\beta}, \quad \beta \approx 0.3$

Tasks with explicit step-wise reasoning (e.g., ScienceQA containing reasoning paths) sometimes improve related reasoning benchmarks, whereas purely numeric tasks often accelerate forgetting.

A plausible implication is that model exposure to reasoning structures preserves or strengthens transferable cognitive patterns, while rote or answer-only tasks hasten the decay of general reasoning abilities.

4. Reasoning-Augmented Continual Learning (RCL)

Reasoning-augmented Continual Learning (RCL) was introduced based on the empirical finding that tasks with explicit reasoning paths mitigate capability collapse. RCL modifies each training example by appending:

A task-specific cue (the final answer)
A meta-rationale (a stepwise CoT paragraph, generated in advance by GPT-4)

During training, the loss function is augmented:

$\mathcal{L} = \mathcal{L}_{\mathrm{task}}(\hat{y}, y) + \lambda \mathcal{L}_{\mathrm{rationale}}(\hat{z}, z)$

where

$\mathcal{L}_{\mathrm{task}} = -\log p(y|x,r)$ (cross-entropy for final answer given input and rationale)
$\mathcal{L}_{\mathrm{rationale}} = -\log p(r|x)$ (for rationale prediction) or an L2 penalty on decoder states
$\lambda$ controls the trade-off between rationale preservation and answer accuracy.

Pseudocode for RCL training:

Initialize θ ← θ_0
for each epoch:
  for each batch of (x, y, r):
    ŷ, r̂ = Model_θ.generate(x)
    ℓ_task = CE(ŷ, y)
    ℓ_rat = CE(r̂, r)
    θ ← θ – η∇_θ[ℓ_task + λℓ_rat]

Empirical results show that LLaMA-2-7B-Chat trained with RCL on only 500 examples per task achieves

OP_t \approx 46.6\%

(vs. 48.7% for sequential fine-tuning [SeqFT] with 5,000 examples),

BWT \approx +13\%

, with reasoning-task decline cut in half. Instruction-following (

\Delta R^I

) is improved by 8% over SeqFT, with safety metrics unaffected.

This suggests that anchoring learning around reasoning structures can partially decouple specialization from catastrophic forgetting in continual learning for LLMs.

5. Preparation, Formatting, and Protocol Details

All datasets are curated to ensure high-quality, balanced splits appropriate for robust continual evaluation. Key preparation steps include:

Sampling 5,000 train and 2,000 test examples per dataset
Ensuring class-balanced distributions for classification tasks (FOMC, C-STANCE)
Retaining only text-based inputs and outputs, excising images or tool call annotations
For long transcripts (MeetingBank), truncating or windowing inputs to fit model context limits
In RCL mode, augmenting outputs with both rationales and answers

This uniform formatting facilitates automation in prompting and evaluation, reducing confounders from schema variance. It also streamlines aggregation of results across diverse skill areas for comprehensive analysis.

6. Significance and Impact for LLM Continual Learning Research

TRACE addresses key deficits in the evaluation of continual learning for advanced aligned LLMs. Its heterogeneous, challenging task suite and standardization enable precise quantification of catastrophic forgetting, performance trade-offs, and resilience mechanisms. The design exposes how existing alignment and instruction-following capabilities are vulnerable to decay during naïve sequential task adaptation.

RCL represents a methodological advance for continual learning: by integrating explicit reasoning traces, it both accelerates convergence to new tasks and helps safeguard general reasoning and instruction-following abilities. Performance declines in general ability and reasoning are substantially attenuated in empirical studies using RCL, without a detrimental effect on convergence speed or safety.

Overall, TRACE provides a rigorous, extensible foundation for the evaluation and development of continual learning strategies in LLMs, highlighting the tension between specialization and the preservation of core competencies, and offering mechanisms for mitigating catastrophic forgetting in practical deployment scenarios.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to TRACE Benchmark Datasets.