Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 104 tok/s
Gemini 3.0 Pro 36 tok/s Pro
Gemini 2.5 Flash 133 tok/s Pro
Kimi K2 216 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

TRACE Benchmark Datasets

Updated 16 November 2025
  • TRACE Benchmark Datasets is a benchmark suite that rigorously evaluates continual learning in LLMs by testing domain-specific knowledge, multilingual processing, code generation, and mathematical reasoning.
  • It employs a unified JSON formatting and automatic evaluation protocol, ensuring balanced and reproducible assessment across eight diverse tasks.
  • The benchmark quantifies performance trade-offs and catastrophic forgetting while demonstrating the benefits of reasoning-augmented continual learning in large language models.

TRACE (Task Robustness and Adaptation via Continual Evaluation) is a benchmark suite designed to rigorously evaluate continual learning in aligned LLMs. Existing continual learning benchmarks are considered insufficiently challenging for modern aligned LLMs due to their simplicity and the possibility of prior model exposure during instruction tuning. TRACE addresses this gap through a collection of eight diverse datasets, each testing distinct, high-difficulty competencies: domain-specific knowledge, multilingual processing, code generation, and mathematical reasoning. Crucially, TRACE introduces a unified data formatting and automatic evaluation protocol, enabling comprehensive and reproducible assessment of catastrophic forgetting and adaptation in LLMs during sequential fine-tuning.

1. Composition and Structure of TRACE

TRACE consists of eight datasets, each contributing 5,000 training and 2,000 test examples, for an aggregate of 40,000 training and 16,000 test instances. These datasets cover the following domains and tasks:

Dataset Domain/Task Evaluation Metric
ScienceQA Multi-hop science QA Accuracy
FOMC Monetary stance (finance) Accuracy
MeetingBank Long-context summarization ROUGE-L
C-STANCE Chinese stance detection Accuracy
20Minuten German text simplification SARI
Py150 Python code completion Edit-distance sim.
NumGLUE-cm Arithmetic reasoning EM/Accuracy
NumGLUE-ds Discrete subtraction EM/Accuracy

The data for each task is standardized into a three-field JSON schema:

1
{"instruction": ..., "input": ..., "output": ...}
This enables all datasets to be fed into a prompting engine using the template:
1
<instruction>\n<input>
with the model expected to produce the <output>.

Domain-Specific Tasks

  • ScienceQA involves 4-way multiple-choice questions requiring multi-hop reasoning, with a chain-of-thought (CoT) option in the answer.
  • FOMC targets financial policy classification from central bank transcripts, reformulated into three labels (A: dovish, B: hawkish, C: neutral).
  • MeetingBank provides city-council meeting transcripts (mean 2,853 words) for abstractive summarization.

Multilingual Tasks

  • C-STANCE addresses stance detection in Chinese social media, following a target-based subtask with labels and prompts in Chinese.
  • 20Minuten comprises German news articles requiring document-level text simplification.

Code Completion

  • Py150 presents Python files for next-line code completion, scored by edit-distance similarity.

Mathematical Reasoning

  • NumGLUE-cm: Arithmetic, coin-math-type problems (with rationales generated via GPT-4).
  • NumGLUE-ds: Discrete subtraction exercises mirroring school mathematics.

For all datasets, preprocessing enforces label balance where appropriate, removes multi-modal content, and ensures consistency (e.g., truncation for long texts).

2. Unified Evaluation Protocols

TRACE employs a single, automatic evaluation framework. Each task is scored with an established metric:

  • Accuracy: ScienceQA, FOMC, C-STANCE, NumGLUE-cm, NumGLUE-ds
  • ROUGE-L: MeetingBank
  • SARI: 20Minuten
  • Edit-distance similarity: Py150

All metrics reside in a common schema, yielding an aggregated TRACE score:

TRACE=i=18wiMetrici\mathrm{TRACE} = \sum_{i=1}^8 w_i \cdot \text{Metric}_i

with wi=1/8w_i = 1/8 for simple averaging. This produces:

TRACEavg=18i=18Metrici\mathrm{TRACE}_{\mathrm{avg}} = \frac{1}{8} \sum_{i=1}^8 \text{Metric}_i

In continual-learning settings, two principal metrics are used for each sequential stage tt:

  • Overall Performance after tt tasks (OPt_t):

OPt=1ti=1tRt,iDOP_t = \frac{1}{t} \sum_{i=1}^t R_{t,i}^D

  • Backward Transfer (BWTt\mathrm{BWT}_t):

BWTt=1ti=1t1(Rt,iDRi,iD)\mathrm{BWT}_t = \frac{1}{t} \sum_{i=1}^{t-1} (R_{t,i}^D - R_{i,i}^D)

where Rt,iDR_{t,i}^D is the model's performance on task ii after training on task tt.

To measure the preservation (or degradation) of original model capabilities post-fine-tuning, TRACE introduces delta measures:

  • ΔRtG\Delta R_t^G: Change in general ability (e.g., MMLU, BBH)
  • ΔRtI\Delta R_t^I: Change in instruction following (e.g., Self-Instruct, LIMA)
  • ΔRtS\Delta R_t^S: Change in safety (e.g., CoNa)

All delta measures are averages over a corresponding benchmark set and compare post-training performance to the initial baseline.

3. Catastrophic Forgetting Phenomena

Sequential fine-tuning of aligned LLMs on TRACE datasets induces pronounced catastrophic forgetting. For instance, after exposure to all eight tasks, LLaMA-2-13B-Chat's arithmetic accuracy on GSM8K-like tasks dropped from approximately 28.8% to about 2%. Simultaneously, ΔRtG\Delta R_t^G (general ability) is strongly negative (up to –13 points on larger models), and instruction-following, ΔRtI\Delta R_t^I, declines by 5–15 points. Safety, as measured by ΔRtS\Delta R_t^S, remains close to baseline. Incremental training reveals a trade-off between gaining task-specific performance and losing general capabilities: while OPtOP_t rises, ΔRtG\Delta R_t^G decreases following an inverse-power law,

ΔRGα(epoch)β,β0.3\Delta R^G \simeq \alpha \cdot (\text{epoch})^{-\beta}, \quad \beta \approx 0.3

Tasks with explicit step-wise reasoning (e.g., ScienceQA containing reasoning paths) sometimes improve related reasoning benchmarks, whereas purely numeric tasks often accelerate forgetting.

A plausible implication is that model exposure to reasoning structures preserves or strengthens transferable cognitive patterns, while rote or answer-only tasks hasten the decay of general reasoning abilities.

4. Reasoning-Augmented Continual Learning (RCL)

Reasoning-augmented Continual Learning (RCL) was introduced based on the empirical finding that tasks with explicit reasoning paths mitigate capability collapse. RCL modifies each training example by appending:

  • A task-specific cue (the final answer)
  • A meta-rationale (a stepwise CoT paragraph, generated in advance by GPT-4)

During training, the loss function is augmented:

L=Ltask(y^,y)+λLrationale(z^,z)\mathcal{L} = \mathcal{L}_{\mathrm{task}}(\hat{y}, y) + \lambda \mathcal{L}_{\mathrm{rationale}}(\hat{z}, z)

where

  • Ltask=logp(yx,r)\mathcal{L}_{\mathrm{task}} = -\log p(y|x,r) (cross-entropy for final answer given input and rationale)
  • Lrationale=logp(rx)\mathcal{L}_{\mathrm{rationale}} = -\log p(r|x) (for rationale prediction) or an L2 penalty on decoder states
  • λ\lambda controls the trade-off between rationale preservation and answer accuracy.

Pseudocode for RCL training:

1
2
3
4
5
6
7
Initialize θ  θ_0
for each epoch:
  for each batch of (x, y, r):
    ŷ, r̂ = Model_θ.generate(x)
    ℓ_task = CE(ŷ, y)
    ℓ_rat = CE(r̂, r)
    θ  θ  η_θ[ℓ_task + λℓ_rat]
Empirical results show that LLaMA-2-7B-Chat trained with RCL on only 500 examples per task achieves OPt46.6%OP_t \approx 46.6\% (vs. 48.7% for sequential fine-tuning [SeqFT] with 5,000 examples), BWT+13%BWT \approx +13\%, with reasoning-task decline cut in half. Instruction-following (ΔRI\Delta R^I) is improved by 8% over SeqFT, with safety metrics unaffected.

This suggests that anchoring learning around reasoning structures can partially decouple specialization from catastrophic forgetting in continual learning for LLMs.

5. Preparation, Formatting, and Protocol Details

All datasets are curated to ensure high-quality, balanced splits appropriate for robust continual evaluation. Key preparation steps include:

  • Sampling 5,000 train and 2,000 test examples per dataset
  • Ensuring class-balanced distributions for classification tasks (FOMC, C-STANCE)
  • Retaining only text-based inputs and outputs, excising images or tool call annotations
  • For long transcripts (MeetingBank), truncating or windowing inputs to fit model context limits
  • In RCL mode, augmenting outputs with both rationales and answers

This uniform formatting facilitates automation in prompting and evaluation, reducing confounders from schema variance. It also streamlines aggregation of results across diverse skill areas for comprehensive analysis.

6. Significance and Impact for LLM Continual Learning Research

TRACE addresses key deficits in the evaluation of continual learning for advanced aligned LLMs. Its heterogeneous, challenging task suite and standardization enable precise quantification of catastrophic forgetting, performance trade-offs, and resilience mechanisms. The design exposes how existing alignment and instruction-following capabilities are vulnerable to decay during naïve sequential task adaptation.

RCL represents a methodological advance for continual learning: by integrating explicit reasoning traces, it both accelerates convergence to new tasks and helps safeguard general reasoning and instruction-following abilities. Performance declines in general ability and reasoning are substantially attenuated in empirical studies using RCL, without a detrimental effect on convergence speed or safety.

Overall, TRACE provides a rigorous, extensible foundation for the evaluation and development of continual learning strategies in LLMs, highlighting the tension between specialization and the preservation of core competencies, and offering mechanisms for mitigating catastrophic forgetting in practical deployment scenarios.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to TRACE Benchmark Datasets.