Liger: Linearizing Large Language Models to Gated Recurrent Structures (2503.01496v2)

Published 3 Mar 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky. The linearization of LLMs transforms pretrained standard models into linear recurrent structures, enabling more efficient deployment. However, current linearization methods typically introduce additional feature map modules that require extensive fine-tuning and overlook the gating mechanisms used in state-of-the-art linear recurrent models. To address these issues, this paper presents Liger, short for Linearizing LLMs to gated recurrent structures. Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters. It repurposes the pretrained key matrix weights to construct diverse gating mechanisms, facilitating the formation of various gated recurrent structures while avoiding the need to train additional components from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA), Liger restores the performance of the linearized gated recurrent models to match that of the original LLMs. Additionally, we introduce Liger Attention, an intra-layer hybrid attention mechanism, which significantly recovers 93\% of the Transformer-based LLM at 0.02\% pre-training tokens during the linearization process, achieving competitive results across multiple benchmarks, as validated on models ranging from 1B to 8B parameters. Code is available at https://github.com/OpenSparseLLMs/Linearization.

PDF Abstract

Overview of "Liger: Linearizing LLMs to Gated Recurrent Structures"

The paper "Liger: Linearizing LLMs to Gated Recurrent Structures" presents a novel approach to transforming Transformer-based LLMs into linear gated recurrent models, enhancing computational efficiency without substantial performance loss. This method, termed Liger, capitalizes on existing pre-trained weights and introduces innovative strategies to construct efficient models suitable for long-sequence tasks.

LLMs have revolutionized natural language processing, but their quadratic computational complexity poses significant efficiency challenges. The computational demand of Transformers is exacerbated by their softmax attention mechanism, which scales poorly with sequence length, particularly in memory and time consumption. Linear recurrent models provide a promising alternative due to their linear time complexity and constant-memory inference, yet the high cost of training these non-standard architectures from scratch remains prohibitive.

Key Contributions

Gated Recurrent Linearization: Liger leverages pre-trained LLM weights to construct gated mechanisms in linear recurrent structures, avoiding the need for additional parameter-heavy components. By reassessing weights from pre-trained models, Liger adopts a lightweight fine-tuning strategy using Low-Rank Adaptation (LoRA), restoring the performance of the linearized models to levels comparable with their original Transformer counterparts.
Liger Attention: An intra-layer hybrid attention mechanism is introduced, mixing sliding window softmax attention with linear recurrent modeling. This design retains softmax non-linearity and accelerates the linearization process, allowing Liger to achieve 93% of the performance of standard Transformer-based LLMs at a fraction of the pre-training cost, achieved using just 0.02% of pre-training tokens.
Experimentation and Scalability: Extensive experiments were conducted on models ranging from 1 billion to 8 billion parameters, validating Liger's efficiency across several benchmarks, including PiQA, ARC, and MMLU. Liger's ability to outperform existing linearization methods—such as SUPRA and LoLCATs—in efficiency and preservation of original model performance underscores its practical applicability.

Implications and Future Directions

This research introduces significant implications for the future design of efficient neural architectures. By seamlessly integrating gated recurrent structures into the established framework of LLMs without extensive re-training, Liger provides a feasible path for deploying capable LLMs on resource-constrained hardware.

Further, the development of the Liger Attention mechanism circumvents the core limitations of adapting linear attention models into contexts traditionally dominated by Transformers, preserving the strengths of both approaches. This hybrid architecture could stimulate further exploration into combining strengths from various architectural paradigms to tackle specific challenges inherent in sequence processing tasks.

For future advancements, exploring different configurations of the Liger framework could reveal more efficient designs, and further broadening the scale of LMMs adapted into linear structures might continue to push the boundaries of practical LLM deployment. Additionally, such frameworks may inspire innovation in related domains, including multimodal AI applications where memory and processing time are critical concerns.

In sum, this work revitalizes the potential of linear recurrent models by integrating them effectively with the strengths of established Transformer architectures, proposing a novel method that strikes a balance between innovation, efficiency, and the practicality of pre-trained model utilization.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Disen Lan (7 papers)
Weigao Sun (19 papers)
Jiaxi Hu (12 papers)
Jusen Du (3 papers)
Yu Cheng (354 papers)

Related Papers

Find Related Papers

GitHub

GitHub - OpenSparseLLMs/Linearization (26 stars)

Tweets

https://twitter.com/TheTuringPost/status/1899430479507739044

Reddit

[2503.01496] Liger: Linearizing Large Language Models to Gated Recurrent Structures (1 point, 0 comments)