This paper investigates the emergence of reflection—a LLM's ability to examine its reasoning and correct errors—during the pre-training phase, challenging the common assumption that this capability arises primarily from post-training reinforcement learning. The authors propose a framework to measure reflection by creating adversarial datasets where models must identify and correct errors in provided chains-of-thought (CoTs) to arrive at the correct answer (AI et al., 5 Apr 2025 ).
Key Concepts and Definitions
The paper distinguishes between two reflection settings and two forms:
- Situational-Reflection: The model reflects on reasoning provided by an external source (e.g., a different model or a deliberately corrupted CoT).
- Self-Reflection: The model reflects on its own previously generated reasoning.
- Explicit Reflection: The model's output includes tokens that explicitly acknowledge and address errors (e.g., "Wait, I made a mistake," "Let's check our work.").
- Implicit Reflection: The model corrects the error and provides the right answer without explicitly mentioning the error or the correction process.
Methodology: Eliciting and Measuring Reflection
The core methodology involves creating specialized datasets and using an automated classifier:
- Adversarial Dataset Generation:
- Situational-Reflection Datasets: Algorithm 1 outlines the process:
- 1. Start with a task instance and its correct CoT (either from a dataset artifact or generated by a capable model like GPT-4o or DeepSeek-V3).
- 2. Use a capable model to introduce subtle errors (e.g., arithmetic mistakes, logical flaws) into the correct CoT, creating an adversarial CoT that leads to an incorrect answer.
- 3. Append a simple trigger phrase like "Wait," to the end of the adversarial CoT within the prompt.
- 4. Keep the original question and the gold standard answer for evaluation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# Pseudocode for Situational-Reflection Dataset Generation def create_situational_reflection_dataset(tasks, frontier_model): D_sit = [] for task in tasks: question = task['question'] gold_answer = task['answer'] correct_cot = get_correct_cot(task, frontier_model) # Or from dataset # Introduce errors to create adversarial CoT adversarial_cot, incorrect_answer = create_adversarial_cot(correct_cot, frontier_model) # Ensure adversarial CoT leads to a different answer if incorrect_answer != gold_answer: prompt_cot = adversarial_cot + "\nWait," D_sit.append({'question': question, 'adversarial_context': prompt_cot, 'gold_answer': gold_answer}) return D_sit
- Self-Reflection Datasets: Algorithm 2 describes this process:
- 1. Run a specific model checkpoint being evaluated on the original base task.
- 2. If the checkpoint generates an incorrect answer along with a CoT, retain this incorrect CoT as the adversarial CoT.
- 3. Append the "Wait," trigger to this self-generated incorrect CoT.
- 4. Keep the original question and gold answer.
- 5. Important Implementation Note: For fair comparison across different stages of pre-training within a model family (e.g., OLMo-2-7B), the self-reflection dataset for that family is filtered to include only those questions that all checkpoints within that family initially answered incorrectly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
# Pseudocode for Self-Reflection Dataset Generation def create_self_reflection_dataset(tasks, model_checkpoint): D_self = [] for task in tasks: question = task['question'] gold_answer = task['answer'] # Generate CoT and answer with the model being evaluated generated_cot, generated_answer = model_checkpoint.generate(question) if generated_answer != gold_answer: prompt_cot = generated_cot + "\nWait," D_self.append({'question': question, 'adversarial_context': prompt_cot, 'gold_answer': gold_answer}) return D_self
- Measuring Reflection:
- Explicit Reflection Classifier: A prompt-based classifier using DeepSeek-V3 is developed. It's given a definition of reflection and few-shot examples to identify outputs containing explicit reflection phrases (validated against human annotations, showing high precision but moderate recall).
- Metrics: (See Table 4 in the paper)
- Accuracy: Fraction of tasks solved correctly (measures overall success).
- Explicit Reflection Rate: Fraction of outputs identified as explicit reflection by the classifier (measures tendency to be explicit).
- Explicit Reflection Accuracy: Fraction of tasks solved correctly with explicit reflection (measures successful explicit correction).
- Implicit Reflection Accuracy: Fraction of tasks solved correctly without explicit reflection (Accuracy - Explicit Reflection Accuracy).
Experimental Setup
- Models: Various pre-training checkpoints of OLMo-2 (7B, 13B, 32B) and released Qwen2.5 models (0.5B to 72B). Pre-training compute is estimated as $6nt$ (6 * parameters * tokens).
- Base Datasets: Tasks sourced from BBH, CruxEval (code input/output prediction), GSM8K & GSM8K-Platinum (math word problems), and TriviaQA (knowledge/comprehension). Adversarial versions were created using the algorithms above.
- Infrastructure: Used vLLM and SGLang for inference on AMD MI300x GPUs.
Key Findings and Practical Implications
- Reflection Emerges During Pre-training: Contrary to common belief, both situational and self-reflection capabilities appear early in pre-training (e.g., OLMo-2-7B with <200B tokens) and consistently improve as pre-training compute increases. This suggests pre-training lays the foundation for complex reasoning.
- Increasing Reliance on Explicit Reflection: As models undergo more pre-training, they not only get better at correcting errors (higher accuracy) but also become more likely to use explicit reflection to do so. This is shown by strong positive correlations between log(pre-training compute) and Accuracy, Explicit Reflection Rate, and Explicit Reflection Accuracy, while Implicit Reflection Accuracy often shows weaker or negative correlations.
- Effectiveness of Simple Triggers: The "Wait," trigger significantly boosts explicit reflection and overall accuracy compared to providing no trigger. This provides a simple, practical inference-time technique to potentially enhance model reasoning and error correction without complex prompting. Analysis shows its effect is akin to activating the model's explicit error-checking mode when successful.
- Self-Reflection is More Challenging: Models find it harder to correct their own mistakes on problems they initially failed (self-reflection setting). While accuracy gains are smaller, the rate of explicit self-reflection still increases robustly with pre-training, indicating that the ability to identify errors develops even if correction isn't always successful.
- Train-Time vs. Test-Time Compute Trade-off: Investing more compute in pre-training reduces the amount of test-time compute (measured by generated tokens/words needed using sequential "Wait," triggers) required to achieve a specific accuracy level on these reflection tasks. This has direct implications for optimizing resource allocation.
- Scalable Benchmarking: The proposed adversarial dataset generation provides a systematic and relatively inexpensive way to create benchmarks for evaluating reflection across different domains and model stages.
- Automated Analysis: The explicit reflection classifier offers a tool to automatically analyze how models are attempting to correct errors, complementing simple accuracy metrics.
Implementation Considerations
- Dataset Creation: Requires access to capable "frontier" models (like GPT-4o, DeepSeek-V3) for generating correct and adversarial CoTs, especially for situational-reflection datasets. Automated checks (e.g., ensuring adversarial CoTs don't reveal the correct answer) are crucial for quality.
- Classifier Reliability: The LLM-based explicit reflection classifier has limitations (high precision, lower recall). Results depending on it should be interpreted knowing it might under-report reflection instances.
- Compute: Evaluating multiple checkpoints across several datasets requires significant inference compute. The train-test compute trade-off analysis (Section 5.4) uses standard FLOPs estimations ($6nt$ for train, $2nw$ for test).
- Trigger Choice: While "Wait," was effective, other simple interjections might also work and could be explored.
In summary, the paper provides strong evidence that reflection is an emergent capability developed during pre-training, offers practical methods (adversarial datasets, simple triggers, automated classification) for evaluating and eliciting this capability, and highlights the trade-offs involved in developing reflective models through pre-training versus test-time compute.