MeCo: Metadata Conditioning then Cooldown
- The paper introduces MeCo, a pre-training method that combines a metadata-conditioning phase with a cooldown phase to enhance sample efficiency and model steerability.
- It utilizes a 90% training phase with prepended metadata and a subsequent 10% phase without metadata to maintain standard inference functionality.
- Empirical results show up to a 33% reduction in pre-training token requirements and improved controllability across multiple model scales and corpora.
Metadata Conditioning then Cooldown (MeCo) is a pre-training method for LLMs designed to enhance both sample efficiency and model steerability by systematically leveraging metadata associated with each document. MeCo operates in two distinct phases: it first conditions training on metadata by prepending a standardized template to every document—such as the domain from which the text was derived—and subsequently transitions into a cooldown stage in which this metadata cue is removed. This approach preserves compatibility with standard LLM inference while enabling prompt-based steering and improving downstream performance per token of pre-training data (Gao et al., 3 Jan 2025).
1. Formal Training Objectives
The MeCo method assumes each example as a pair , where is the tokenized text and is its metadata (e.g., domain such as "en.wikipedia.org"). Both are tokenized with the same vocabulary.
1.1 Metadata-Conditioning Phase
For the first 90% of training tokens, input sequences are constructed as:
1 |
["URL:", c_1, ..., c_K, "\n\n", x_1, ..., x_T] |
1.2 Cooldown Phase
For the final 10% of tokens, the model is trained on raw text without any prepended metadata: The time-dependent mixed objective can be expressed as: with switching from $1$ to $0$ after training tokens.
2. Metadata Encoding and Input Construction
Metadata is extracted from each training document, typically as the absolute domain of the document's URL (e.g., "en.wikipedia.org"). This is concatenated to a template string:
1 |
"URL: en.wikipedia.org\n\n" + document_text |
Ablation studies indicate that alternative metadata—such as hashed document IDs ("7dsjuj3a-olp0") or short topic tags generated by a model ("technology leader biography")—yields performance comparable to explicit semantic metadata, provided the grouping of documents is consistent.
3. Training Schedule and Hyperparameters
All experiments are conducted with LLaMA-style Transformers and the LLaMA-3 tokenizer. Hyperparameter selection is as follows:
| Parameter | Value | Variation for 8B model |
|---|---|---|
| Optimizer | AdamW (β₁=0.9, β₂=0.95) | Same |
| Peak learning rate | (for stability) | |
| Weight decay | 0.033 | 0.1 |
| Batch size | 4M tokens/step | Same |
| LR schedule | 5% linear warmup, cosine decay to 10% | Same |
| Metadata/Standard phase split | 90% / 10% token ratio | Same |
| Total pre-training tokens | 160B (600M, 1.6B, 3B), 80B (8B) |
Ablation indicates a cooldown phase of 10–20% is optimal for downstream validation. The metadata phase always precedes the cooldown, with the optimizer and schedule continuing smoothly across the boundary.
4. Empirical Results
MeCo demonstrates significant improvements in pre-training sample efficiency, cross-task generalization, and controllability. Core findings include:
- Sample Efficiency: On the DCLM corpus, the 1.6B parameter model with MeCo (144B metadata + 16B cooldown tokens) achieves an OLMES average 5-shot score of 56.7, matching the standard baseline trained with 240B tokens—a 33% reduction in data required for parity.
- Performance Across Scales: Consistent improvements are observed for 600M, 1.6B, 3B, and 8B models, with MeCo-enhanced models outperforming standard pre-training at the same compute budget.
- Corpus Generality: MeCo yields higher scores than baseline across diverse corpora (C4, RefinedWeb, DCLM), demonstrating domain-agnostic benefit.
- Perplexity vs. Accuracy: Lower perplexity on the validation set does not predict the downstream task gains realized by MeCo. For instance, the 240B token baseline has a lower validation PPL than the 160B token MeCo-trained model despite identical average accuracy.
5. Implementation Considerations
To implement MeCo:
- Data Preparation: Each document should be paired with some consistent metadata (domain, collection, hashed ID, or topic tag).
- Preprocessing: 90% of examples are prepended with the metadata string using the template and the rest are retained as standard text. Associated loss masks must zero out the metadata tokens.
- Training Pipeline: Minimal code modification is required—only the data loader and loss calculation must accommodate the input and mask. No additional parameter or computational overhead arises from MeCo conditioning.
- Resource Planning: Compute requirements are identical to standard pre-training when corpus size is held fixed. If token count is expanded, wall clock time approximately doubles.
6. Conditional Generation and Model Steering
After MeCo pre-training, LLMs can be steered by prepending metadata-like prompt fragments at inference, even those not encountered during training. Illustrative example applications:
- Enhancing Factual Accuracy:
- Prompt:
"URL: factquizmaster.com\n\nQ: Who wrote ‘Pride and Prejudice’? \nA:" - Outcome: +6% absolute gain on CommonsenseQA (zero-shot), compared to unconditional generation.
- Prompt:
- Mitigating Toxicity:
- Generations with standard, unconditional input score via Detoxify.
- Prepending
"URL: en.wikipedia.org"reduces toxicity to . - Standard models do not realize such benefit, underlining a unique property of MeCo pre-training.
This conditionality, enabled by consistent training exposure, provides a lightweight steering mechanism applicable to a wide array of target behaviors.
7. Significance and Broader Implications
MeCo is a minimal-modification plug-in for large-scale LLM pre-training pipelines. It delivers up to 33% gains in sample efficiency, robust improvements across scales and corpora, and introduces a simple, effective conditionality mechanism at inference. No changes to model architecture or training compute per step are necessary. This framework illustrates that data-centric, metadata-aware regimens can yield substantial improvements in pre-training outcomes and controllability (Gao et al., 3 Jan 2025).