Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models (2505.14871v1)

Published 20 May 2025 in cs.CL and cs.LG

Abstract: The efficient implementation of LLMs is crucial for deployment on resource-constrained devices. Low-rank tensor compression techniques, such as tensor-train (TT) networks, have been widely studied for over-parameterized neural networks. However, their applications to compress pre-trained LLMs for downstream tasks (post-training) remains challenging due to the high-rank nature of pre-trained LLMs and the lack of access to pretraining data. In this study, we investigate low-rank tensorized LLMs during fine-tuning and propose sparse augmented tensor networks (Saten) to enhance their performance. The proposed Saten framework enables full model compression. Experimental results demonstrate that Saten enhances both accuracy and compression efficiency in tensorized LLMs, achieving state-of-the-art performance.

PDF Abstract

Sparse Augmented Tensor Networks for Compression of LLMs

The research paper "Saten: Sparse Augmented Tensor Networks for Post-Training Compression of LLMs" introduces a novel framework aimed at the efficient compression of LLMs to enable their deployment on resource-constrained devices. The existing methods for LLM compression such as pruning, distillation, and quantization, while beneficial, have limitations when applied to pre-trained models due to their high-rank structure and the unavailability of pre-training data. To address these challenges, the authors present the Sparse Augmented Tensor Networks (Saten) framework that leverages tensor-train (TT) decomposition combined with a sparse error approximation.

Contributions and Methodology

The primary contribution of this work is the development of the Saten framework that integrates sparsity with low-rank tensor networks to enhance model performance while maintaining compression. The framework is evaluated using two approaches:

Saten(u): Utilizes unstructured sparsity by retaining the top absolute values of the error matrix.
Saten(2:4): Employs a structured sparsity of 2:4 to optimize execution on GPUs.

The framework compresses LLMs by first folding a weight matrix into a high-order tensor and applying TT decomposition to derive both low-rank components and sparse matrices. This combination allows Saten to maintain accuracy while significantly reducing model and computational complexities.

Experimental Results

The experimental analysis spans two major LLM architectures: BERT-Base and LLaMA-3.2-1B, evaluated over datasets such as GLUE and SuperGLUE.

BERT-Base Compression: The Saten approach achieves superior performance compared to baseline models like TT and SVD. Specifically, Saten(u) and Saten(2:4) reduce model parameters by nearly 50% while outperforming previous methods in accuracy.
LLaMA-3.2-1B Compression: Saten retains model performance with a compression efficiency of over 60% in parameter reduction. The Saten(2:4) variant notably maintains accuracy across various datasets while benefiting from structured sparsity.

Theoretical and Practical Implications

Theoretically, this work extends the capabilities of low-rank tensor models by compensating for the high-rank nature of pre-trained LLM parameter spaces. The sparse augmentation allows the proposed Saten model to approximate high-dimensional weights effectively, demonstrating that blending sparsity and low-rank structures can yield high compression rates without significantly sacrificing model performance.

Practically, Saten presents a viable path for deploying LLMs on devices with limited computational resources. The structured 2:4 sparsity, in particular, facilitates implementation on GPU platforms, suggesting that future hardware optimizations could further enhance runtime efficiency. As AI models continue to grow in size, Saten provides a promising solution for balancing performance with resource constraints.

Future Directions

Further work could explore adaptive sparsity patterns or integrate quantization into the Saten framework to further enhance compression metrics. Additionally, custom hardware designs that support sparse tensor operations could unlock further potential in real-time applications. With the advent of larger LLMs, the Saten framework’s architecture-agnostic design primes it for scalability and cross-architecture applicability, thus providing a promising outlook for its adoption across various AI deployments.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Ryan Solgi (4 papers)
Kai Zhen (18 papers)
Rupak Vignesh Swaminathan (10 papers)
Nathan Susanj (12 papers)
Athanasios Mouchtaris (31 papers)
Siegfried Kunzmann (13 papers)
Zheng Zhang (486 papers)

Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models (2505.14871v1)

Sparse Augmented Tensor Networks for Compression of LLMs

Contributions and Methodology

Experimental Results

Theoretical and Practical Implications

Future Directions

Tweets

YouTube

Saten: Sparse Augmented Tensor Networks for Post-Training Compression of Large Language Models (2505.14871v1)

Sparse Augmented Tensor Networks for Compression of LLMs

Contributions and Methodology

Experimental Results

Theoretical and Practical Implications

Future Directions

Related Papers

Tweets

YouTube