BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model (2309.11568v1)

Published 20 Sep 2023 in cs.AI, cs.CL, and cs.LG

Abstract: We introduce the Bittensor LLM, called "BTLM-3B-8K", a new state-of-the-art 3 billion parameter open-source LLM. BTLM-3B-8K was trained on 627B tokens from the SlimPajama dataset with a mixture of 2,048 and 8,192 context lengths. BTLM-3B-8K outperforms all existing 3B parameter models by 2-5.5% across downstream tasks. BTLM-3B-8K is even competitive with some 7B parameter models. Additionally, BTLM-3B-8K provides excellent long context performance, outperforming MPT-7B-8K and XGen-7B-8K on tasks up to 8,192 context length. We trained the model on a cleaned and deduplicated SlimPajama dataset; aggressively tuned the \textmu P hyperparameters and schedule; used ALiBi position embeddings; and adopted the SwiGLU nonlinearity. On Hugging Face, the most popular models have 7B parameters, indicating that users prefer the quality-size ratio of 7B models. Compacting the 7B parameter model to one with 3B parameters, with little performance impact, is an important milestone. BTLM-3B-8K needs only 3GB of memory with 4-bit precision and takes 2.5x less inference compute than 7B models, helping to open up access to a powerful LLM on mobile and edge devices. BTLM-3B-8K is available under an Apache 2.0 license on Hugging Face: https://huggingface.co/cerebras/btlm-3b-8k-base.

PDF HTML Abstract

Overview of BTLM-3B-8K

The paper introduces BTLM-3B-8K, a LLM designed to perform at a comparable level to 7 billion parameter models but with only 3 billion parameters. This achievement represents significant advancements in parameter efficiency and model optimization. The model was trained on the SlimPajama dataset, comprising 627 billion tokens, and optimized to handle two different context lengths, 2,048 and 8,192 tokens, to improve its capacity to model long-range contextual dependencies.

Key Design and Training Strategies

Architecture and Techniques:
- BTLM-3B-8K is based on the GPT-3 autoregressive transformer decoder architecture but includes specific modifications:
  - SwiGLU Activation Function: This replaces GELU to enhance training dynamics.
  - ALiBi Position Embeddings: Employed instead of learned position embeddings to enable better extrapolation over unseen sequence lengths during training.
  - Maximal Update Parameterization (μP): Utilized to scale hyperparameters effectively and manage activation scaling with model width.
Training and Data:
- The model was trained on the SlimPajama dataset, a refined subset of the RedPajama dataset, further filtered for data quality, and compressed from over a trillion tokens.
- Two-phase training was conducted: initially using 2,048 token contexts, followed by 8,192 token contexts, fine-tuning the model for broader contextual understanding and inference efficiency.
Computational Resources and Strategy:
- Training occurred on the Condor Galaxy 1 (CG-1) AI supercomputer using Cerebras CS-2 systems, capitalizing on their data parallelism to simplify training scaling without complex model slicing.
- Hyperparameters were finely tuned using smaller proxy models to ensure effective transferability to the target model size.

Evaluation and Results

BTLM-3B-8K was subjected to rigorous evaluation across various benchmarks, covering domains such as common sense reasoning, world knowledge, reading comprehension, and more. The model outperformed existing 3 billion parameter models and demonstrated competitive results against some 7 billion parameter models across these tasks:

Common Sense Reasoning: Demonstrated superior capabilities across tasks like PIQA and HellaSwag, with notable improvements over peer models.
Reading Comprehension and World Knowledge: Achieved higher average accuracies compared to other 3 billion models and competitively against larger models.
Long Context Inference: Successfully outperformed some 7 billion parameter counterparts in tasks requiring understanding and interpolation over long contexts.

Implications and Future Directions

The contributions highlighted in this paper underscore the potential of tailoring training strategies and architectural tweaks to yield models that balance performance with computational and memory efficiency. BTLM-3B-8K's capacity to function effectively on edge devices, requiring just 3GB of RAM with quantization, could enable novel applications and broaden accessibility to AI technologies. This model sets a precedent for further explorations into parameter-efficient architectures, suggesting fertile ground for future innovations in model training optimizations and deployment scenarios.

As AI models continue to scale in size and complexity, the integration of efficiency-focused strategies, like those detailed for BTLM-3B-8K, could herald new paradigms in sustainable AI development. The methodologies tested here, including mixed-precision training, its impact observed through reduced computational loads and improved inference speed, could serve as a blueprint for subsequent research endeavors and practical applications of LLMs in resource-constrained environments.

PDF Markdown Bookmark Chat (Pro)

Authors (14)

Nolan Dey (7 papers)
Daria Soboleva (8 papers)
Faisal Al-Khateeb (2 papers)
Bowen Yang (55 papers)
Ribhu Pathria (2 papers)
Hemant Khachane (2 papers)
Shaheer Muhammad (2 papers)
Zhiming (4 papers)
Chen (63 papers)
Robert Myers (25 papers)
Jacob Robert Steeves (1 paper)
Natalia Vassilieva (11 papers)
Marvin Tom (2 papers)
Joel Hestness (23 papers)

Citations (12)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos