PruneCD: Enhanced Contrastive Decoding
- PruneCD is a contrastive decoding framework for LLMs that uses targeted layer pruning to construct a more informative amateur model.
- It mitigates hallucinations by generating sharper token probability contrasts and reducing flat, uninformative output distributions.
- Empirical evaluations across benchmarks demonstrate that PruneCD enhances factuality with minimal inference overhead and seamless integration.
PruneCD is a contrastive decoding framework for LLMs that utilizes layer-pruned self-contrast to improve the factuality of generated text. In contrast to prior approaches—such as DoLa, which rely on early exit logits as a contrastive prior but suffer from flat and non-informative outputs—PruneCD constructs its “amateur” model by pruning selected intermediate layers from the full model. This layer-pruned design produces sharper, more meaningful contrast in the decoded token probabilities, significantly enhancing the capacity to mitigate hallucination while maintaining minimal inference overhead (Yu et al., 20 Sep 2025).
1. Motivation and Conceptual Foundations
LLMs often generate hallucinated or factually inconsistent content due to overconfidence in poorly grounded outputs. Contrastive decoding (CD) addresses this by penalizing tokens favored by an “amateur” model that is presumed to be more uncertain or less reliable than the “expert” model. Traditional CD implementations use early network exits for the amateur model, but analysis shows that such logits tend to be high-entropy, flat, and uninformative, resulting in weak contrast and limited improvement in factuality.
PruneCD overcomes these limitations by constructing the amateur model via targeted layer pruning. By removing specific intermediate layers (rather than truncating the layer stack from the top), the pruned model’s output distribution is less uniform and better aligned with the expert, yielding stronger and more informative contrast signals for decoding.
2. Methodology and Formalism
In PruneCD, the core mechanism consists of producing “expert” and “amateur” logits for each token generation step and combining them to produce a contrastive score. For a given input sequence and candidate next token , the score is evaluated as:
where is the probability from the expert model (using the full layer set ), and is from the layer-pruned amateur model (using for a pruned subset ).
Layer selection for pruning is determined via a factual layer search, in which each candidate layer’s ablation is empirically assessed for impact on factuality scores using a metric on datasets such as TruthfulQA. The layers whose removal produces the greatest factuality drop are selected, and the top- such layers constitute .
3. Comparison with Early Exit and DoLa
Prior contrastive decoding (DoLa) employs early exit logits, relying on model outputs after halting forward propagation at an intermediate layer. Analysis reveals that such strategies yield distributions with high entropy (flatness):
and extremely low overlap with the expert’s top- predicted tokens:
Empirical measurements show that early exit logits have entropy values orders of magnitude higher than full or layer-pruned models (e.g., 11.75 vs. 1.37), and the top-25 token overlap is much lower (0.43 vs. 15.50). Layer-pruned logits in PruneCD retain substantially more structure and informative variation, directly addressing the flatness and informativeness deficits of early exit approaches.
4. Empirical Results and Analysis
Qualitative and quantitative analyses corroborate the superiority of layer pruning for constructing informative contrasts:
- Visualization of logits and softmax outputs demonstrates that early exit logits are nearly uniform across the output space, whereas layer-pruned logits display differentiated structures more closely aligned with the expert model.
- Over 1,000 TriviaQA samples, average entropy and top-25 overlap metrics validate that PruneCD’s amateur logits are substantially more informative.
- Across benchmarks (TruthfulQA, TriviaQA, Natural Questions, GSM8K) and model scales (1B–8B parameters), PruneCD consistently improves factuality over greedy and DoLa decoding.
- The offline search and batched evaluation of both expert and pruned logits enable the method to incur only minimal inference overhead relative to standard greedy decoding.
5. Implementation and Practical Integration
PruneCD is designed for practical deployment:
- The factual layer search for the optimal pruning set is conducted in a brief offline phase, typically via one-at-a-time layer ablation and assessment of factuality loss.
- During inference, PruneCD requires only a single forward pass (with a slightly modified computational graph) to obtain both expert and pruned logits, enabling efficient batched computation.
- No additional models, external data, or extensive hyperparameter tuning are required, fostering drop-in compatibility with existing LLM decoding pipelines.
6. Implications and Applications
The PruneCD approach provides a robust, efficient method for reducing hallucination in LLMs. By leveraging a principled amateur construction via layer pruning, it achieves a better tradeoff between informativeness and uncertainty than early exit. This allows the contrastive penalty to more effectively suppress overconfident, factually unsupported outputs.
Applications include open-domain dialogue, factual QA, content generation, and any context where factual reliability is paramount. The minimal computational overhead and absence of extra training or calibration requirements make PruneCD particularly attractive for latency-sensitive or production-grade environments.
7. Summary
PruneCD advances the field of contrastive decoding for LLMs by replacing early exit amateur models with an empirically optimized, layer-pruned alternative. This design yields non-flat, informative contrast signals backed by rigorous entropy and overlap analysis, and demonstrates marked factuality improvement across benchmarks. Its practical efficiency and plug-and-play nature position it as a robust strategy for mitigating hallucinations in large-scale LLMs (Yu et al., 20 Sep 2025).