Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mini-GPTs: Efficient Large Language Models through Contextual Pruning

Published 20 Dec 2023 in cs.CL and cs.AI | (2312.12682v1)

Abstract: In AI research, the optimization of LLMs remains a significant challenge, crucial for advancing the field's practical applications and sustainability. Building upon the foundational work of Professor Song Han's lab at MIT, this paper introduces a novel approach in developing Mini-GPTs via contextual pruning. Our methodology strategically prunes the computational architecture of traditional LLMs, like Phi-1.5, focusing on retaining core functionalities while drastically reducing model sizes. We employ the technique across diverse and complex datasets, including US law, Medical Q&A, Skyrim dialogue, English-Taiwanese translation, and Economics articles. The results underscore the efficiency and effectiveness of contextual pruning, not merely as a theoretical concept but as a practical tool in developing domain-specific, resource-efficient LLMs. Contextual pruning is a promising method for building domain-specific LLMs, and this research is a building block towards future development with more hardware compute, refined fine-tuning, and quantization.

Citations (7)

Summary

  • The paper proposes contextual pruning to shrink LLMs while retaining core, domain-specific functionalities.
  • It details targeted pruning strategies for linear, activation, and embedding layers using normalized L1-norm and token frequency metrics.
  • Evaluations with perplexity and MCQ tests demonstrate significant size reduction with maintained or improved task performance.

Mini-GPTs: Efficient LLMs through Contextual Pruning

The paper "Mini-GPTs: Efficient LLMs through Contextual Pruning" proposes a method to reduce the size of LLMs by applying contextual pruning. This technique focuses on retaining core functionalities of models like Phi-1.5 while significantly decreasing their computational demands, thus maintaining efficiency and domain-specific performance.

Introduction and Motivation

The paper addresses the significant computational and environmental costs associated with LLMs. Model pruning, a well-explored area in model optimization, serves as the foundation for this research. Contextual pruning is a targeted approach that removes non-essential weights from a network, focusing on domain-specific needs like law, healthcare, and finance. This contrasts with broader training datasets, which often include redundant information for specialized applications (Figure 1). Figure 1

Figure 1: Comparison between magnitudes of neurons between skyrim and healthcare domains.

Methodology

Data Collection and Model Selection

The research utilizes datasets from diverse domains to evaluate pruning impacts. Selected models include GPT-like architectures such as Phi-1.5, Opt-1.3, and Llama-1.3. These models are robust in various NLP tasks and come with pre-configured BPE tokenizers in HuggingFace, contributing to their widespread use.

Contextual Pruning Techniques

Contextual pruning was applied primarily to linear layers, activation layers, and embeddings, each targeting specific model components.

Linear Layer Pruning

The linear layers are pruned by analyzing neuron outputs and calculating the normalized L1-norm of each neuron for dataset-specific pruning (Figure 2). This method identifies underutilized neurons across datasets and enables targeted pruning. Figure 2

Figure 2: Linear Layer Pruning.

Activation Layer Pruning

For activation layers, the method only considers outputs to determine non-essential neurons. Similar to linear layers, if a neuron’s normalized L1-norm is below the threshold, it is pruned from the prior layer's transpose weight matrix (Figure 3). Figure 3

Figure 3: Activation Layer Pruning.

Embedding Layer Pruning

Embedding layers undergo pruning based on token frequency within each dataset. This necessitates large calibration sets to ensure accurate pruning decisions, as smaller datasets might not reflect true token necessity (Figure 4). Figure 4

Figure 4: Embedding Layer Pruning.

Evaluation and Results

Perplexity and MCQ Testing

Perplexity, a standard metric for evaluating LLMs, and MCQ testing were employed to assess performance post-pruning. The experiments showed improved or maintained perplexity and competitive MCQ performances, indicating effective trade-offs between model size and task performance.

Large Pruning Threshold Exploration

Exploration into larger pruning thresholds (10−110^{-1}) provided insight into the limits of pruning. Significant size reduction was achieved, but results suggested potential overfitting, necessitating further study to mitigate such effects.

Conclusion and Future Work

Contextual pruning as a methodology for developing Mini-GPTs showcases promising results in maintaining domain-specific performance with reduced resource usage. Future research will focus on several key areas:

  • Max Neuron Magnitude Pruning: Exploring pruning based on neuron magnitude thresholds.
  • Larger Dataset Fine-Tuning: Mitigating overfitting with larger datasets to enhance model robustness.
  • Integration with Optimization Techniques: Combining pruning with quantization for better performance.
  • Expansion to More Models: Applying the approach to newer models like Phi-2 to observe universal applicability.

This research lays a foundation for more sustainable and efficient LLM applications, particularly beneficial for industries requiring on-prem installations, such as healthcare and gaming.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.