Papers
Topics
Authors
Recent
Search
2000 character limit reached

Falcon-180B: 180B-Param Transformer LLM

Updated 4 June 2026
  • Falcon-180B is a 180 billion-parameter causal Transformer language model featuring 80 layers and multigroup attention for optimized performance.
  • It is pretrained on approximately 3.5 trillion tokens from diverse, high-quality sources, achieving competitive results on benchmarks like HellaSwag and HumanEval.
  • Its training leverages advanced 3D parallelism and memory optimizations, reducing pretraining costs and democratizing access to frontier LLM research.

Falcon-180B is a 180 billion-parameter causal decoder-only Transformer LLM constituting the largest member of the Falcon series of open LLMs. Trained on approximately 3.5 trillion tokens of predominantly web-sourced text, Falcon-180B represents the largest openly documented pretraining run to date and nears the performance of leading closed models such as GPT-4 and PaLM-2-Large while being available under a permissive “Responsible-Use” license. The model substantially outperforms prior open LLMs (e.g., Chinchilla, Gopher, LLaMA-2) and demonstrates competitive results on a broad suite of language understanding, commonsense reasoning, and coding benchmarks (Almazrouei et al., 2023).

1. Model Architecture

Falcon-180B implements a causal decoder-only Transformer architecture with a total parameter count P=180×109P = 180 \times 10^9, distributed across L=80L = 80 layers (Transformer blocks) and a hidden size dmodel=14848d_{\text{model}} = 14\,848. The model utilizes a multigroup attention mechanism (an extension of multiquery attention) featuring nq=232n_{q} = 232 query heads and nkv=8n_{kv} = 8 key/value heads, serving to optimize memory efficiency and performance at scale. Rotary positional embeddings (RoPE) are adopted for encoding position information.

The feed-forward sublayers consist of a two-layer MLP with GeLU activation, no SwiGLU variant, and biases are removed from all linear layers. Attention and MLP blocks are parallelized to minimize synchronization overhead within tensor-parallel execution. Notably, the memory footprint for the key-value cache scales as O(logP)O(\log P), improving over prior O(Plogn)O(\sqrt{P} \cdot \log n) approaches, and empirical depth scaling for optimal model performance is observed to be approximately proportional to logP\log P.

2. Pretraining Dataset and Data Refinement

Falcon-180B is pretrained on T3.5×1012T \approx 3.5 \times 10^{12} tokens drawn from a high-quality, diverse corpus. The primary data source is the RefinedWeb English web crawl, accounting for 76% of tokens (approximately 2.7T tokens), with additional coverage from RefinedWeb in European languages (8%), Project Gutenberg books (6%), conversation data such as Reddit and StackOverflow (5%), permissively licensed code from GitHub (3%), and technical domains including arXiv, PubMed, USPTO, and Wikipedia (2%).

The RefinedWeb macrodata refinement pipeline consists of three stages: (1) URL-based blocklisting and language identification (resulting in 48% English content); (2) heuristic filtering on text quality, length, and symbol ratios, which halves the dataset; and (3) two-stage deduplication (using MinHash followed by suffix-array methods) that further halves the corpora, retaining only 12% of the original material. Microdata mixes in curated subsets of the Pile and Reddit data, with conversation trees encoded via custom attention masks to cover all branches without repetition.

A 600B-token extract of RefinedWeb is released for reproducibility and community research.

3. Training Methodology and Compute Infrastructure

Training was conducted on a GPU cluster of up to 4,096 NVIDIA A100 40GB GPUs hosted on AWS, interconnected with 50 Gbps bandwidth. Parallelism is implemented in a 3D configuration: tensor parallelism (TP=8), pipeline parallelism (PP=8), and data parallelism (DP=64), augmented with ZeRO-1 optimizer sharding. The model processes input sequences of length 2,048 tokens.

The total compute budget comprises approximately 43,500 PF-days (3×1023\approx 3 \times 10^{23} FLOPs), computed as

L=80L = 800

Batch and learning rate schedules involve micro-batches of 2,048 tokens per GPU, warm-up over 4B tokens to a peak learning rate (L=80L = 801), batch size ramp-up to 100B tokens for data-parallel efficiency, cosine decay schedule (L=80L = 802 reduction), weight decay of 0.1, z-loss of L=80L = 803, and gradient clipping at 0.4.

Memory optimizations include the use of FlashAttention (Triton) for exact L=80L = 804 attention, selective (“monolayer”) recomputation of layer norm and activations (halves activation memory), and bfloat16 training (with no stochastic rounding).

4. Distributed Training Tooling and Software Engineering

The custom distributed codebase “Gigatron” is based on PyTorch with 3D parallelism and ZeRO-1 optimizer integration. Communication optimizations encompass tensor-parallel splitting for attention/MLP columns and rows with minimal all-reduce operations, pipeline parallelism managed via 1F1B scheduling with scatter/gather improvements, and a workflow where gradient all-reduce is followed by reduce_scatter, the optimizer step, and all_gather for parameters.

Kernels for FlashAttention and fused RoPE/LayerNorm are implemented with Triton, and run management supports topology-agnostic checkpoints, automated node health testing, and bandwidth-based placement using Gromov-Wasserstein optimal transport. AdamW is employed as the optimizer, using bfloat16 for weights and gradients and fp32 for moments.

5. Empirical Performance and Evaluation

Falcon-180B achieves competitive performance on a range of established NLP benchmarks. On PaLM-style 1-shot GPT-3 tasks, it attains 77.1% average accuracy, nearly matching PaLM-2-Large (77.5%, or 99.5% of their score). On GPT-4 reported few-shot tasks, the model scores 89.0 on HellaSwag (cf. GPT-3.5: 85.5; GPT-4: 95.3), 87.1 on Winogrande (cf. 81.6/87.5), 87.8 on ARC-Challenge (cf. 85.2/96.3), and 70.6 on 5-shot MMLU (cf. 70.0/86.5).

Falcon-180B outperforms Chinchilla, Gopher, and LLaMA-2 on commonsense and QA (e.g., HellaSwag 89.0, ARC-Ch 63.7), and matches leading models in code generation (HumanEval pass@1: 35.4%, comparable to Inflection-1 and PaLM-Coder at 35.9%). On the EleutherAI Harness aggregate (HellaSwag, LAMBADA, Winogrande, PIQA, ARC, OpenBookQA), Falcon models lead within each size class.

6. Licensing, Dataset Release, and Open-Science Commitment

Falcon-180B is distributed under a “Responsible-Use” license, which carries downstream restrictions, while Falcon-7B and Falcon-40B are released under Apache-2.0. The 600B-token extract of RefinedWeb is available under a permissive license. A principal aim is to democratize access to frontier LLM research, foster reproducibility, and enable community-driven model improvements.

7. Key Formulas and Scaling Properties

Critical formulas employed in Falcon-180B’s system design include the total parameter count:

L=80L = 805

and the total compute (FLOPs):

L=80L = 806

Memory per parameter for AdamW and moment storage is calculated as L=80L = 807 bytes, with sharding by DP yielding L=80L = 808 bytes per parameter.

These scaling and efficiency properties drive Falcon-180B’s ability to provide high-resource LLM capabilities with reduced pretraining and inference cost relative to comparable proprietary models (Almazrouei et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Falcon-180B.