Bitune: Bidirectional Instruction-Tuning (2405.14862v1)

Published 23 May 2024 in cs.CL

Abstract: We introduce Bitune, a method that improves instruction-tuning of pretrained decoder-only LLMs, leading to consistent gains on downstream tasks. Bitune applies both causal and bidirectional attention to the prompt, to obtain a better representation of the query or instruction. We realize this by introducing two sets of parameters, for which we apply parameter-efficient finetuning techniques. These causal and bidirectional features are then combined into a weighted average with trainable coefficients, which is subsequently used to generate new tokens. We demonstrate significant improvements in zero-shot performance on commonsense reasoning, arithmetic, and language understanding tasks, while extensive ablation studies validate the role of each component and demonstrate the method's agnosticism to different PEFT techniques.

Summary

The paper introduces a dual-attention mechanism that combines causal and bidirectional features for more effective instruction tuning.
It leverages low-rank adaptation to enable parameter-efficient finetuning, yielding consistent improvements in commonsense and arithmetic reasoning tasks.
Experiments demonstrate that blending attention with learnable mixing coefficients significantly outperforms standard LoRA techniques across various model architectures.

Bitune: A Hybrid Attention Approach for Instruction Tuning in Pretrained Decoder-Only LLMs

The paper introduces Bitune, a novel technique for instruction-tuning pretrained decoder-only LLMs. Traditional models employ either causal or bidirectional attention mechanisms, but Bitune integrates both to enhance model performance on downstream tasks by leveraging parameter-efficient finetuning techniques (PEFT).

Methodology

Bitune's approach involves applying both causal and bidirectional attention mechanisms to the input instruction during the prefilling phase, resulting in two sets of features—causal and bidirectional. These features are combined using trainable mixing coefficients and subsequently used to generate new tokens in an autoregressive manner. This dual-attention strategy aims to provide a more informative and robust representation of the instruction.

Two Sets of Features

Bitune introduces dual passes over the input instruction:

Causal Attention Features: These are obtained by passing the instruction through the model using causal attention, which processes tokens sequentially, considering only past inputs.
Bidirectional Attention Features: Derived from a separate pass using bidirectional attention, allowing for a global context of tokens, effectively utilizing both preceding and succeeding words for enhanced contextual understanding.

The resulting feature vectors are combined through a learnable weighted average mechanism. The resulting features are then used as keys and values (KV cache) for the autoregressive generation of new tokens.

Parameter Efficient Finetuning

To enable this dual-attention mechanism without a significant increase in parameters, Bitune adopts low-rank adaptation (LoRA) for parameter-efficient finetuning. This approach allows multiple sets of lightweight adapters to be applied to the pretrained model, efficiently enabling the integration of bidirectional attention while maintaining overall model structure and efficiency.

Experimental Validation

The paper evaluates Bitune across a range of tasks, demonstrating marked improvements in zero-shot performance on commonsense reasoning, arithmetic, and language understanding tasks. The key metrics (accuracy scores) consistently outperform baseline models fine-tuned using standard LoRA techniques. Ablation studies further validate the necessity and impact of each component of Bitune, showing that combining both causal and bidirectional attention mechanisms with learnable mixing coefficients significantly enhances model effectiveness.

Results

The results are significant, showcasing improvements across multiple models and scales:

Commonsense Reasoning: Tasks like PIQA, ARC, and CSQA showed marked performance improvements, with up to 4 percentage point gains over baseline models.
Arithmetic and Language Understanding: On GSM8K, a dataset focused on arithmetic reasoning, the approach demonstrated consistent performance gains, suggesting improved reasoning capabilities in generative tasks.

Bitune's improvements are consistent across various LLM architectures, such as Gemma and Llama, ranging from 2 billion to 8 billion parameters. This versatility underscores the method's robustness and general applicability.

Implications

The introduction of Bitune has both practical and theoretical implications:

Practical: By improving the instruction-following capabilities of LLMs in a parameter-efficient manner, Bitune enables more efficient and effective deployment of LLMs in real-world applications, potentially reducing computational costs and improving response quality.
Theoretical: This approach reaffirms the value of bidirectional attention mechanisms in NLP tasks and opens avenues for future research on hybrid attention approaches within autoregressive models.

Future Directions

The paper suggests several directions for future research:

Exploration of Additional PEFT Techniques: While Bitune successfully employs LoRA, further exploration and integration of other PEFT methods could yield additional performance gains.
Dynamic Attention Mechanisms: Investigating adaptive or dynamic mechanisms for blending causal and bidirectional features could streamline the process and further enhance performance.
Cross-Task Generalization: Further studies could examine Bitune's effectiveness across a broader set of downstream tasks, including cross-lingual and multimodal applications.

Conclusion

Bitune represents a significant advancement in instruction-tuning for pretrained decoder-only LLMs by integrating causal and bidirectional attention features. Its parameter-efficient design ensures broader applicability and scalability, making it a valuable contribution to NLP research. The empirical results affirm the method's effectiveness across various tasks, demonstrating consistent improvements and establishing a strong case for hybrid attention mechanisms in enhancing model performance.