TRACE Benchmark Datasets
- TRACE Benchmark Datasets is a benchmark suite that rigorously evaluates continual learning in LLMs by testing domain-specific knowledge, multilingual processing, code generation, and mathematical reasoning.
- It employs a unified JSON formatting and automatic evaluation protocol, ensuring balanced and reproducible assessment across eight diverse tasks.
- The benchmark quantifies performance trade-offs and catastrophic forgetting while demonstrating the benefits of reasoning-augmented continual learning in large language models.
TRACE (Task Robustness and Adaptation via Continual Evaluation) is a benchmark suite designed to rigorously evaluate continual learning in aligned LLMs. Existing continual learning benchmarks are considered insufficiently challenging for modern aligned LLMs due to their simplicity and the possibility of prior model exposure during instruction tuning. TRACE addresses this gap through a collection of eight diverse datasets, each testing distinct, high-difficulty competencies: domain-specific knowledge, multilingual processing, code generation, and mathematical reasoning. Crucially, TRACE introduces a unified data formatting and automatic evaluation protocol, enabling comprehensive and reproducible assessment of catastrophic forgetting and adaptation in LLMs during sequential fine-tuning.
1. Composition and Structure of TRACE
TRACE consists of eight datasets, each contributing 5,000 training and 2,000 test examples, for an aggregate of 40,000 training and 16,000 test instances. These datasets cover the following domains and tasks:
| Dataset | Domain/Task | Evaluation Metric |
|---|---|---|
| ScienceQA | Multi-hop science QA | Accuracy |
| FOMC | Monetary stance (finance) | Accuracy |
| MeetingBank | Long-context summarization | ROUGE-L |
| C-STANCE | Chinese stance detection | Accuracy |
| 20Minuten | German text simplification | SARI |
| Py150 | Python code completion | Edit-distance sim. |
| NumGLUE-cm | Arithmetic reasoning | EM/Accuracy |
| NumGLUE-ds | Discrete subtraction | EM/Accuracy |
The data for each task is standardized into a three-field JSON schema:
1 |
{"instruction": ..., "input": ..., "output": ...} |
1 |
<instruction>\n<input> |
<output>.
Domain-Specific Tasks
- ScienceQA involves 4-way multiple-choice questions requiring multi-hop reasoning, with a chain-of-thought (CoT) option in the answer.
- FOMC targets financial policy classification from central bank transcripts, reformulated into three labels (A: dovish, B: hawkish, C: neutral).
- MeetingBank provides city-council meeting transcripts (mean 2,853 words) for abstractive summarization.
Multilingual Tasks
- C-STANCE addresses stance detection in Chinese social media, following a target-based subtask with labels and prompts in Chinese.
- 20Minuten comprises German news articles requiring document-level text simplification.
Code Completion
- Py150 presents Python files for next-line code completion, scored by edit-distance similarity.
Mathematical Reasoning
- NumGLUE-cm: Arithmetic, coin-math-type problems (with rationales generated via GPT-4).
- NumGLUE-ds: Discrete subtraction exercises mirroring school mathematics.
For all datasets, preprocessing enforces label balance where appropriate, removes multi-modal content, and ensures consistency (e.g., truncation for long texts).
2. Unified Evaluation Protocols
TRACE employs a single, automatic evaluation framework. Each task is scored with an established metric:
- Accuracy: ScienceQA, FOMC, C-STANCE, NumGLUE-cm, NumGLUE-ds
- ROUGE-L: MeetingBank
- SARI: 20Minuten
- Edit-distance similarity: Py150
All metrics reside in a common schema, yielding an aggregated TRACE score:
with for simple averaging. This produces:
In continual-learning settings, two principal metrics are used for each sequential stage :
- Overall Performance after tasks (OP):
- Backward Transfer ():
where is the model's performance on task after training on task .
To measure the preservation (or degradation) of original model capabilities post-fine-tuning, TRACE introduces delta measures:
- : Change in general ability (e.g., MMLU, BBH)
- : Change in instruction following (e.g., Self-Instruct, LIMA)
- : Change in safety (e.g., CoNa)
All delta measures are averages over a corresponding benchmark set and compare post-training performance to the initial baseline.
3. Catastrophic Forgetting Phenomena
Sequential fine-tuning of aligned LLMs on TRACE datasets induces pronounced catastrophic forgetting. For instance, after exposure to all eight tasks, LLaMA-2-13B-Chat's arithmetic accuracy on GSM8K-like tasks dropped from approximately 28.8% to about 2%. Simultaneously, (general ability) is strongly negative (up to –13 points on larger models), and instruction-following, , declines by 5–15 points. Safety, as measured by , remains close to baseline. Incremental training reveals a trade-off between gaining task-specific performance and losing general capabilities: while rises, decreases following an inverse-power law,
Tasks with explicit step-wise reasoning (e.g., ScienceQA containing reasoning paths) sometimes improve related reasoning benchmarks, whereas purely numeric tasks often accelerate forgetting.
A plausible implication is that model exposure to reasoning structures preserves or strengthens transferable cognitive patterns, while rote or answer-only tasks hasten the decay of general reasoning abilities.
4. Reasoning-Augmented Continual Learning (RCL)
Reasoning-augmented Continual Learning (RCL) was introduced based on the empirical finding that tasks with explicit reasoning paths mitigate capability collapse. RCL modifies each training example by appending:
- A task-specific cue (the final answer)
- A meta-rationale (a stepwise CoT paragraph, generated in advance by GPT-4)
During training, the loss function is augmented:
where
- (cross-entropy for final answer given input and rationale)
- (for rationale prediction) or an L2 penalty on decoder states
- controls the trade-off between rationale preservation and answer accuracy.
Pseudocode for RCL training:
1 2 3 4 5 6 7 |
Initialize θ ← θ_0 for each epoch: for each batch of (x, y, r): ŷ, r̂ = Model_θ.generate(x) ℓ_task = CE(ŷ, y) ℓ_rat = CE(r̂, r) θ ← θ – η∇_θ[ℓ_task + λℓ_rat] |
This suggests that anchoring learning around reasoning structures can partially decouple specialization from catastrophic forgetting in continual learning for LLMs.
5. Preparation, Formatting, and Protocol Details
All datasets are curated to ensure high-quality, balanced splits appropriate for robust continual evaluation. Key preparation steps include:
- Sampling 5,000 train and 2,000 test examples per dataset
- Ensuring class-balanced distributions for classification tasks (FOMC, C-STANCE)
- Retaining only text-based inputs and outputs, excising images or tool call annotations
- For long transcripts (MeetingBank), truncating or windowing inputs to fit model context limits
- In RCL mode, augmenting outputs with both rationales and answers
This uniform formatting facilitates automation in prompting and evaluation, reducing confounders from schema variance. It also streamlines aggregation of results across diverse skill areas for comprehensive analysis.
6. Significance and Impact for LLM Continual Learning Research
TRACE addresses key deficits in the evaluation of continual learning for advanced aligned LLMs. Its heterogeneous, challenging task suite and standardization enable precise quantification of catastrophic forgetting, performance trade-offs, and resilience mechanisms. The design exposes how existing alignment and instruction-following capabilities are vulnerable to decay during naïve sequential task adaptation.
RCL represents a methodological advance for continual learning: by integrating explicit reasoning traces, it both accelerates convergence to new tasks and helps safeguard general reasoning and instruction-following abilities. Performance declines in general ability and reasoning are substantially attenuated in empirical studies using RCL, without a detrimental effect on convergence speed or safety.
Overall, TRACE provides a rigorous, extensible foundation for the evaluation and development of continual learning strategies in LLMs, highlighting the tension between specialization and the preservation of core competencies, and offering mechanisms for mitigating catastrophic forgetting in practical deployment scenarios.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free