MeCo: Metadata Conditioning then Cooldown

Updated 28 January 2026

The paper introduces MeCo, a pre-training method that combines a metadata-conditioning phase with a cooldown phase to enhance sample efficiency and model steerability.
It utilizes a 90% training phase with prepended metadata and a subsequent 10% phase without metadata to maintain standard inference functionality.
Empirical results show up to a 33% reduction in pre-training token requirements and improved controllability across multiple model scales and corpora.

Metadata Conditioning then Cooldown (MeCo) is a pre-training method for LLMs designed to enhance both sample efficiency and model steerability by systematically leveraging metadata associated with each document. MeCo operates in two distinct phases: it first conditions training on metadata by prepending a standardized template to every document—such as the domain from which the text was derived—and subsequently transitions into a cooldown stage in which this metadata cue is removed. This approach preserves compatibility with standard LLM inference while enabling prompt-based steering and improving downstream performance per token of pre-training data (Gao et al., 3 Jan 2025).

1. Formal Training Objectives

The MeCo method assumes each example as a pair $(c, x)$ , where $x = (x_1, ..., x_T)$ is the tokenized text and $c$ is its metadata (e.g., domain such as "en.wikipedia.org"). Both are tokenized with the same vocabulary.

1.1 Metadata-Conditioning Phase

For the first 90% of training tokens, input sequences are constructed as:

1	["URL:", c_1, ..., c_K, "\n\n", x_1, ..., x_T]

Loss is only computed over

x

, excluding the initial

K+2

metadata tokens:

\mathcal{L}_{\mathrm{meta}}(\theta) = \mathbb{E}_{(c,x)}\left[ -\sum_{t=K+3}^{K+2+T} \log P_\theta(y_t | y_{<t}) \right]

or equivalently, using a mask

m_t

set to $0$ for

t \leq K+2

and $1$ otherwise:

\mathcal{L}_{\mathrm{meta}}(\theta) = \mathbb{E}_{(c,x)} \left[ -\sum_{t=1}^{K+2+T} m_t \log P_\theta(y_t | y_{<t}) \right]

1.2 Cooldown Phase

For the final 10% of tokens, the model is trained on raw text without any prepended metadata: $\mathcal{L}_{\mathrm{std}}(\theta) = \mathbb{E}_{x} \left[ -\sum_{t=1}^{T} \log P_\theta(x_t | x_{<t}) \right]$ The time-dependent mixed objective can be expressed as: $\mathcal{L}_t(\theta) = \alpha(t)\mathcal{L}_{\mathrm{meta}}(\theta) + [1-\alpha(t)]\mathcal{L}_{\mathrm{std}}(\theta)$ with $\alpha(t)$ switching from $1$ to $0$ after $0.9 T_{\rm all}$ training tokens.

2. Metadata Encoding and Input Construction

Metadata is extracted from each training document, typically as the absolute domain of the document's URL (e.g., "en.wikipedia.org"). This is concatenated to a template string:

1	"URL: en.wikipedia.org\n\n" + document_text

Both metadata and text utilize the same tokenizer and embedding matrix; no special tokens or segment embeddings are introduced. Loss masking ensures the model is not trained to predict the metadata template, focusing learning on downstream tokens.

Ablation studies indicate that alternative metadata—such as hashed document IDs ("7dsjuj3a-olp0") or short topic tags generated by a model ("technology leader biography")—yields performance comparable to explicit semantic metadata, provided the grouping of documents is consistent.

3. Training Schedule and Hyperparameters

All experiments are conducted with LLaMA-style Transformers and the LLaMA-3 tokenizer. Hyperparameter selection is as follows:

Parameter	Value	Variation for 8B model
Optimizer	AdamW (β₁=0.9, β₂=0.95)	Same
Peak learning rate	$3 \times 10^{-3}$	$5 \times 10^{-4}$ (for stability)
Weight decay	0.033	0.1
Batch size	4M tokens/step	Same
LR schedule	5% linear warmup, cosine decay to 10%	Same
Metadata/Standard phase split	90% / 10% token ratio	Same
Total pre-training tokens	160B (600M, 1.6B, 3B), 80B (8B)

Ablation indicates a cooldown phase of 10–20% is optimal for downstream validation. The metadata phase always precedes the cooldown, with the optimizer and schedule continuing smoothly across the boundary.

4. Empirical Results

MeCo demonstrates significant improvements in pre-training sample efficiency, cross-task generalization, and controllability. Core findings include:

Sample Efficiency: On the DCLM corpus, the 1.6B parameter model with MeCo (144B metadata + 16B cooldown tokens) achieves an OLMES average 5-shot score of 56.7, matching the standard baseline trained with 240B tokens—a 33% reduction in data required for parity.
Performance Across Scales: Consistent improvements are observed for 600M, 1.6B, 3B, and 8B models, with MeCo-enhanced models outperforming standard pre-training at the same compute budget.
Corpus Generality: MeCo yields higher scores than baseline across diverse corpora (C4, RefinedWeb, DCLM), demonstrating domain-agnostic benefit.
Perplexity vs. Accuracy: Lower perplexity on the validation set does not predict the downstream task gains realized by MeCo. For instance, the 240B token baseline has a lower validation PPL than the 160B token MeCo-trained model despite identical average accuracy.

5. Implementation Considerations

To implement MeCo:

Data Preparation: Each document should be paired with some consistent metadata $c$ (domain, collection, hashed ID, or topic tag).
Preprocessing: 90% of examples are prepended with the metadata string using the template and the rest are retained as standard text. Associated loss masks must zero out the metadata tokens.
Training Pipeline: Minimal code modification is required—only the data loader and loss calculation must accommodate the input and mask. No additional parameter or computational overhead arises from MeCo conditioning.
Resource Planning: Compute requirements are identical to standard pre-training when corpus size is held fixed. If token count is expanded, wall clock time approximately doubles.

6. Conditional Generation and Model Steering

After MeCo pre-training, LLMs can be steered by prepending metadata-like prompt fragments at inference, even those not encountered during training. Illustrative example applications:

Enhancing Factual Accuracy:
- Prompt: "URL: factquizmaster.com\n\nQ: Who wrote ‘Pride and Prejudice’? \nA:"
- Outcome: +6% absolute gain on CommonsenseQA (zero-shot), compared to unconditional generation.
Mitigating Toxicity:
- Generations with standard, unconditional input score $\approx 0.35$ via Detoxify.
- Prepending "URL: en.wikipedia.org" reduces toxicity to $\approx 0.12$ .
- Standard models do not realize such benefit, underlining a unique property of MeCo pre-training.

This conditionality, enabled by consistent training exposure, provides a lightweight steering mechanism applicable to a wide array of target behaviors.

7. Significance and Broader Implications

MeCo is a minimal-modification plug-in for large-scale LLM pre-training pipelines. It delivers up to 33% gains in sample efficiency, robust improvements across scales and corpora, and introduces a simple, effective conditionality mechanism at inference. No changes to model architecture or training compute per step are necessary. This framework illustrates that data-centric, metadata-aware regimens can yield substantial improvements in pre-training outcomes and controllability (Gao et al., 3 Jan 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Metadata Conditioning Accelerates Language Model Pre-training (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Metadata Conditioning then Cooldown (MeCo).

MeCo: Metadata Conditioning then Cooldown

1. Formal Training Objectives

1.1 Metadata-Conditioning Phase

1.2 Cooldown Phase

2. Metadata Encoding and Input Construction

3. Training Schedule and Hyperparameters

4. Empirical Results

5. Implementation Considerations

6. Conditional Generation and Model Steering

7. Significance and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MeCo: Metadata Conditioning then Cooldown

1. Formal Training Objectives

1.1 Metadata-Conditioning Phase

1.2 Cooldown Phase

2. Metadata Encoding and Input Construction

3. Training Schedule and Hyperparameters

4. Empirical Results

5. Implementation Considerations

6. Conditional Generation and Model Steering

7. Significance and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research