Super Tiny Language Models (2405.14159v2)

Published 23 May 2024 in cs.CL and cs.AI

Abstract: The rapid advancement of LLMs has led to significant improvements in natural language processing but also poses challenges due to their high computational and energy demands. This paper introduces a series of research efforts focused on Super Tiny LLMs (STLMs), which aim to deliver high performance with significantly reduced parameter counts. We explore innovative techniques such as byte-level tokenization with a pooling mechanism, weight tying, and efficient training strategies. These methods aim to significantly reduce reduce the parameter count compared to traditional models -- in future works, we aim to build on these in a way that maintains and improves upon the performance of base transformer models. This series of papers will explore into various subproblems, including tokenizer-free models, self-play based training, and alternative training objectives. We will target models with 10M, 50M, and 100M parameters. Our ultimate goal is to make high-performance LLMs more accessible and practical for a wide range of applications.

Authors (6)

Dylan Hillier (5 papers)
Leon Guertler (3 papers)
Cheston Tan (49 papers)
Palaash Agrawal (5 papers)
Chen Ruirui (1 paper)
Bobby Cheng (4 papers)

Citations (2)

View on Semantic Scholar

Summary

Super Tiny LLMs: Advancements in Parameter Efficiency

The paper under review addresses the escalating computational and energy constraints posed by LLMs, introducing a paradigm shift towards Super Tiny LLMs (STLMs). The overarching objective is to create LLMs with significantly reduced parameter counts while retaining high performance. Through various innovative techniques, the authors demonstrate a reduction in parameter counts by 90% to 95% compared to traditional models, targeting models with 10M, 50M, and 100M parameters. This paper's contributions hold significant implications for both theoretical research and practical applications in NLP.

Parameter Reduction Techniques

The research explores multiple avenues for reducing model parameters without sacrificing performance:

Weight Tying: This method involves sharing weights between different components of the model. It is an effective strategy to reduce the model size while maintaining alignment between different parts of the network. Various implementations include embedding head weight tying, FFN sharing, and FFN+Attention sharing. For instance, ALBERT extensively ties weights across all transformer layers, yielding a compact model with high efficiency.
Pruning: Inspired by the lottery ticket hypothesis, pruning eliminates weights contributing least to the model's performance, resulting in a sparser network. This method has the dual advantage of parameter reduction and computational efficiency.
Quantization: By reducing the precision of weights from 32-bit floating-point numbers to lower-bit representations, quantization substantially decreases model size and enhances training/inference speeds with minimal performance degradation.
Low-Rank Factorization: This technique decomposes large weight matrices into products of smaller matrices, thus reducing the computational cost. Such methods have been successfully employed in compressing BERT-like models.

Data Quality and Training Efficiency

In addition to direct parameter reduction, the paper emphasizes improving the training signal quality through strategies like data selection and knowledge distillation:

Data Selection: Improved data quality, such as training on textbooks and refined web data, has shown to enhance the performance of small LLMs significantly. The Phi series of models is a case in point, where a curriculum of carefully curated data allows smaller models to match the performance of much larger counterparts.
Knowledge Distillation: Here, a smaller "student" model learns from a larger "teacher" model by mimicking the teacher's probability distribution over tokens. Models like DistilBERT have benefitted from this approach, retaining most of the parent model's competencies at a fraction of the size.

Proposed Approaches and Research Projects

Byte-Level/Tokenizer-Free Models

The paper proposes an innovative byte-level tokenizer with pooling mechanisms, aimed at circumventing the large vocabularies typically required by traditional tokenizers. By drastically reducing the parameter count in the embedding and next token head layers, this approach offers a promising avenue for efficient model architectures.

Early Exit and Conditional Computation

Inspired by Mixture of Experts and similar methods, the authors propose techniques that allow different tokens to undergo different amounts of computation. This concept is particularly efficient for tokens that do not require deep-layer processing, thus saving computational resources.

Next Thought Prediction

This technique decouples reasoning capabilities from LLMing by introducing latent sequences. Such strategies are expected to improve the model's reasoning ability without necessitating direct LLM adaptations.

Dropout and Learning Rate Scheduling

Exploring dropout scheduling and learning rate schedulers can provide insights into minimizing overfitting and enhancing training efficiency. Initial results suggest these methods are particularly effective during the initial and final phases of training.

Curriculums, Data Mixes, and Multimodality

Lastly, the paper underscores the importance of diverse and high-quality training data. Utilizing well-rounded datasets like the British National Corpus, and refined versions of large web crawls, can potentially overcome limitations posed by smaller corpora.

Implications and Future Directions

The development of STLMs contributes significantly to making high-performance NLP accessible and practical, particularly for researchers with limited computational resources. Training models under 50M parameters on consumer-grade GPUs in under 48 hours heralds a democratization of AI research. Furthermore, the diverse techniques detailed offer robust avenues for future exploration in model efficiency, conditional computation, and adaptive learning paradigms.

The potential applications of STLMs are vast, ranging from edge devices to scenarios requiring rapid model iteration and experimentation. The theoretical implications also open new research vistas in compressing and optimizing pre-existing architectures while maintaining or improving performance benchmarks.

In summary, this paper provides a comprehensive framework for developing and refining STLMs, marrying theoretical innovations with practical viability. It sets the stage for the next wave of advancements in NLP, steering the field towards more resource-efficient yet highly capable LLMs.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/GptMaestro/status/1794769848175890648

YouTube

Show All Videos

HackerNews

Super Tiny Language Models (pdf) (3 points, 1 comment)