Implicit Chain of Thought Reasoning via Knowledge Distillation (2311.01460v1)

Published 2 Nov 2023 in cs.CL, cs.AI, and cs.LG

Abstract: To augment LLMs with the ability to reason, researchers usually prompt or finetune them to produce chain of thought reasoning steps before producing the final answer. However, although people use natural language to reason effectively, it may be that LMs could reason more effectively with some intermediate computation that is not in natural language. In this work, we explore an alternative reasoning approach: instead of explicitly producing the chain of thought reasoning steps, we use the LLM's internal hidden states to perform implicit reasoning. The implicit reasoning steps are distilled from a teacher model trained on explicit chain-of-thought reasoning, and instead of doing reasoning "horizontally" by producing intermediate words one-by-one, we distill it such that the reasoning happens "vertically" among the hidden states in different layers. We conduct experiments on a multi-digit multiplication task and a grade school math problem dataset and find that this approach enables solving tasks previously not solvable without explicit chain-of-thought, at a speed comparable to no chain-of-thought.

PDF Abstract

Implicit Chain of Thought Reasoning via Knowledge Distillation

The paper introduces a novel approach to reasoning in LLMs by leveraging implicit rather than explicit chain of thought (CoT) methodologies. The authors propose that while explicit CoT, which articulates each reasoning step in a sequence, aligns with human problem-solving, it may not fully utilize the computational efficiencies offered by modern transformer architectures. The implicit CoT framework seeks to harness vertical reasoning, i.e., reasoning through hidden states across layers, rather than the traditional horizontal, sequential articulation of intermediate steps.

Methodology

The proposed three-step strategy involves:

Mind-Reading the Teacher: A student model is trained to understand the intermediate reasoning steps from a teacher model's continuous hidden states. This model does not generate the intermediate steps but uses the teacher’s hidden states to derive the final answer directly.
Thought Emulation: Knowledge distillation is employed to train an emulator that predicts these hidden states directly from the input. The emulator uses vertical reasoning, thus eliminating explicit horizontal reasoning steps required by traditional CoT.
Couple and Optimize: The emulator and student are coupled together and further optimized end-to-end, allowing the student model to potentially develop its own efficient reasoning pathways that might differ from the teacher's methods.

Experimental Results

The paper reports experiments on multi-digit multiplication and grade school math problems. Key findings include:

On five-digit multiplication, the implicit CoT approach with GPT-2 Medium achieved a significant improvement of 96% accuracy, compared to the mere 2% with no CoT.
The method also showed increased efficiency, achieving near-real-time performance relative to explicit CoT, which introduces notable latency due to the generation of intermediate steps.
For grade school math problems, a 22% accuracy was reported, which remains lower than explicit CoT but demonstrates potential given the computational efficiency gained.

Implications and Future Directions

The introduction of implicit CoT offers implications both practically and theoretically:

Efficiency and Scalability: The strategy allows for faster inference by bypassing intermediate token generation, thus paving the way for more real-time applications in reasoning-intensive tasks.
Model Optimization: By leveraging the internal hidden states of LLMs, it promotes optimization in the vertical dimension, aligning with the core transformer architecture.
Potential in Model Training: This approach suggests potential integration into pretraining regimes, fostering models capable of both explicit and implicit reasoning.

However, the methodology remains less transparent, exemplifying the trade-off between interpretability and efficiency. Therefore, further studies could explore this transparency challenge by integrating methods that maintain efficiency while providing insight into the model's reasoning processes.

Conclusion

The paper presents a compelling strategy to leverage transformer architectures efficiently via implicit CoT reasoning. While explicit CoT continues to outperform in accuracy, the research demonstrates promising advancements in reasoning efficiency. Future explorations might focus on expanding model layers to better capture complex reasoning tasks and integrating implicit reasoning into LLM pretraining. The findings offer a pathway toward enhancing the computational potential of LLMs while providing a foundation for furthering AI reasoning capabilities.