Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multilingual Contrastive Decoding via Language-Agnostic Layers Skipping (2407.10795v1)

Published 15 Jul 2024 in cs.CL

Abstract: Decoding by contrasting layers (DoLa), is designed to improve the generation quality of LLMs by contrasting the prediction probabilities between an early exit output (amateur logits) and the final output (expert logits). However, we find that this approach does not work well on non-English tasks. Inspired by previous interpretability work on language transition during the model's forward pass, we discover that this issue arises from a language mismatch between early exit output and final output. In this work, we propose an improved contrastive decoding algorithm that is effective for diverse languages beyond English. To obtain more helpful amateur logits, we devise two strategies to skip a set of bottom, language-agnostic layers based on our preliminary analysis. Experimental results on multilingual reasoning benchmarks demonstrate that our proposed method outperforms previous contrastive decoding baselines and substantially improves LLM's chain-of-thought reasoning accuracy across 11 languages. The project will be available at: https://github.com/NJUNLP/SkipLayerCD.

A Review of "Multilingual Contrastive Decoding via Language-Agnostic Layers Skipping"

The paper "Multilingual Contrastive Decoding via Language-Agnostic Layers Skipping" by Wenhao Zhu, Sizhe Liu, Shujian Huang, Shuaijie She, Chris Wendler, and Jiajun Chen introduces an advanced approach to enhance the text generation capabilities of LLMs across multiple languages. This work addresses the limitations of the Decoding by Contrasting Layers (DoLa) method on non-English tasks, incorporating insights from recent interpretability studies on language transitions during model forward pass to propose a more effective decoding strategy.

Contrastive Decoding and its Challenges

Contrastive decoding improves text generation quality by balancing the logits of an expert model against those from a less capable amateur model at each inference step. This method requires an "amateur" model to contrast logits and reduce logical errors but faces challenges when a smaller, suitable amateur model is unavailable. DoLa attempts to resolve this by contrasting the early exit logits of the same expert model.

However, DoLa's efficacy diminishes on non-English tasks due to a language mismatch. The logit lens approach, as employed by the authors, illustrates that early layers predominantly generate English tokens even when the target language is different. This mismatch fails to provide a helpful contrastive effect, highlighting the need for a better strategy that accommodates multilingual generation.

Proposed Method: Language-Agnostic Layer Skipping

To address the identified limitation, the authors propose an improved contrastive decoding mechanism by skipping a set of language-agnostic bottom layers while preserving computations in the upper layers. This takes advantage of the model's three-phase working pattern during forward computation: understanding the context, generating concept-level representations, and converting these representations into the target language.

Two strategies are devised for effective layer skipping:

  1. Heuristic Layer Skipping (SL-H): This involves randomly skipping a few of the lower half layers (excluding the lowest 4 layers) based on predefined heuristics.
  2. Dynamic Layer Skipping (SL-D): This automatically determines the optimal skipping span based on entropy change, targeting the phase transition from context understanding to concept generation.

Experimental Results

Comprehensive experiments on multilingual reasoning benchmarks (mGSM) and English reasoning tasks (AQuA, GSM8K, GSM-Plus) demonstrate the superiority of the proposed method over DoLa and direct inference. Specifically, the paper presents results with Mistral-7B, Baichuan2-7B, Deepseek-7B, LLaMA3-8B, and LLaMA2-13B models, showing significant improvements in reasoning accuracy:

  • On the multilingual mGSM dataset, layer skipping strategies (SL-H and SL-D) consistently outperform direct inference and DoLa, with SL-D showing slightly better performance than SL-H in most cases.
  • For English reasoning, the algorithm also shows comparable performance to the vanilla contrastive decoding approach, doing so without the need for an additional amateur model.

Performance Insights

The findings validate the proposed technique's efficacy in providing more useful amateur logits by skipping computations in the designated layer spans. They also offer empirical evidence supporting the language transition-based theoretical model.

Implications and Future Directions

This research has significant implications for the development and deployment of LLMs in multilingual settings. By enhancing decoding strategies, LLMs can generate higher quality, more contextually relevant text across a broader range of languages. Future studies can expand upon this work by:

  • Exploring layer skipping in LLM architectures with alternative or mixed computation patterns, such as Mixture-of-Experts.
  • Investigating the impact of different types of perturbations in the context understanding phase beyond layer skipping.

Conclusion

The paper makes a notable contribution to multilingual natural language generation by proposing a more robust way of integrating contrastive decoding with LLMs. Its empirical validation through extensive experimentation underscores the importance and utility of maintaining computations in upper model layers to ensure coherent and logical text generation across diverse languages. This work opens avenues for further refinement of decoding strategies and deeper understanding of LLM internals.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Wenhao Zhu (32 papers)
  2. Sizhe Liu (9 papers)
  3. Shujian Huang (106 papers)
  4. Shuaijie She (8 papers)
  5. Chris Wendler (22 papers)
  6. Jiajun Chen (125 papers)
Citations (1)