2000 character limit reached
OLMoE: Open Mixture-of-Experts Language Models (2409.02060v2)
Published 3 Sep 2024 in cs.CL, cs.AI, and cs.LG
Abstract: We introduce OLMoE, a fully open, state-of-the-art LLM leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.
- Yi: Open Foundation Models by 01.AI.
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.
- A Survey on Data Selection for Language Models.
- SantaCoder: don’t reach for the stars!
- SmolLM - blazingly fast and remarkably powerful.
- Zeyuan Allen-Zhu and Yuanzhi Li. 2024. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws.
- The Falcon Series of Open Language Models.
- PaLM 2 Technical Report.
- Efficient Large Scale Language Modeling with Mixtures of Experts.
- Llemma: An Open Language Model For Mathematics.
- Layer Normalization.
- Qwen Technical Report.
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.
- Constitutional AI: Harmlessness from AI Feedback.
- Stable LM 2 1.6B Technical Report.
- Conditional Computation in Neural Networks for faster models.
- Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling.
- Lessons from the Trenches on Reproducible Evaluation of Language Models.
- PIQA: Reasoning about Physical Commonsense in Natural Language.
- GPT-NeoX-20B: An Open-Source Autoregressive Language Model.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow.
- The Foundation Model Transparency Index.
- Language Models are Few-Shot Learners.
- Tianle Cai. 2023. Mixtral from Mistral.
- InternLM2 Technical Report.
- Generative pretraining from pixels.
- Evaluating Large Language Models Trained on Code.
- Soumith Chintala. 2024. GPT-4 MoE.
- PaLM: Scaling Language Modeling with Pathways.
- Deep reinforcement learning from human preferences.
- Unified Scaling Laws for Routed Language Models.
- BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions.
- Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.
- Training Verifiers to Solve Math Word Problems.
- Together Computer. 2023. RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset.
- MoEUT: Mixture-of-Experts Universal Transformers.
- UltraFeedback: Boosting Language Models with High-quality Feedback.
- DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models.
- Databricks. 2024. DBRX.
- Language Modeling with Gated Convolutional Networks.
- DeepSeek LLM: Scaling Open-Source Language Models with Longtermism.
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.
- Scaling Vision Transformers to 22 Billion Parameters.
- Universal Transformers.
- Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster.
- PaLM-E: An Embodied Multimodal Language Model.
- GLaM: Efficient Scaling of Language Models with Mixture-of-Experts.
- Tricks for Training Sparse Translation Models.
- The Llama 3 Herd of Models.
- Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.
- Learning Factored Representations in a Deep Mixture of Experts.
- The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding.
- KTO: Model Alignment as Prospect Theoretic Optimization.
- CroissantLLM: A Truly Bilingual French-English Language Model.
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.
- Language models scale reliably with over-training and on downstream tasks.
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts.
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling.
- A framework for few-shot language model evaluation.
- ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools.
- SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning.
- Catwalk: A Unified Language Model Evaluation Framework for Many Datasets.
- OLMo: Accelerating the Science of Language Models.
- Hard Mixtures of Experts for Large Scale Weakly Supervised Vision.
- OLMES: A Standard for Language Model Evaluations.
- Textbooks Are All You Need.
- Xu Owen He. 2024. Mixture of A Million Experts.
- Measuring Massive Multitask Language Understanding.
- Measuring Mathematical Problem Solving With the MATH Dataset.
- Training Compute-Optimal Large Language Models.
- ORPO: Monolithic Preference Optimization without Reference Model.
- MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies.
- Music Transformer.
- Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2.
- Sparse is Enough in Scaling Transformers.
- Mistral 7B.
- Mixtral of Experts.
- Scaling Laws for Neural Language Models.
- Andrej Karpathy. 2024. LLM model size competition is intensifying… backwards!
- The hateful memes challenge: Competition report.
- Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization.
- The Stack: 3 TB of permissively licensed source code.
- Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints.
- Scaling Laws for Fine-Grained Mixture of Experts.
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.
- BASE Layers: Simplifying Training of Large, Sparse Models.
- DataComp-LM: In search of the next generation of training sets for language models.
- Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models.
- StarCoder: may the source be with you!
- AlpacaEval: An Automatic Evaluator of Instruction-following Models.
- Textbooks Are All You Need II: phi-1.5 technical report.
- Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts.
- Holistic Evaluation of Language Models.
- Jamba: A Hybrid Transformer-Mamba Language Model.
- MoE-LLaVA: Mixture of Experts for Large Vision-Language Models.
- TruthfulQA: Measuring How Models Mimic Human Falsehoods.
- MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts.
- RegMix: Data Mixture as Regression for Language Model Pre-training.
- Routers in Vision Mixture of Experts: An Empirical Study.
- LLM360: Towards Fully Transparent Open-Source LLMs.
- The Flan Collection: Designing Data and Methods for Effective Instruction Tuning.
- The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI.
- Consent in Crisis: The Rapid Decline of the AI Data Commons.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization.
- SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages.
- StarCoder 2 and The Stack v2: The Next Generation.
- FinGPT: Large Generative Models for a Small Language.
- Paloma: A Benchmark for Evaluating Language Model Fit.
- MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training.
- OpenELM: An Efficient Language Model Family with Open Training and Inference Framework.
- SimPO: Simple Preference Optimization with a Reference-Free Reward.
- Pointer Sentinel Mixture Models.
- Mixed Precision Training.
- Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.
- Cross-Task Generalization via Natural Language Crowdsourcing Instructions.
- Niklas Muennighoff. 2020. Vilio: State-of-the-art Visio-Linguistic Models applied to Hateful Memes.
- OctoPack: Instruction Tuning Code Large Language Models.
- Scaling Data-Constrained Language Models.
- Generative Representational Instruction Tuning.
- Crosslingual Generalization through Multitask Finetuning.
- Soft Merging of Experts with Adaptive Routing.
- Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts.
- Nemotron-4 340B Technical Report.
- GPT-4 Technical Report.
- Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models.
- Nemotron-4 15B Technical Report.
- OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text.
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only.
- RWKV: Reinventing RNNs for the Transformer Era.
- Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence.
- Ofir Press and Lior Wolf. 2017. Using the Output Embedding to Improve Language Models.
- Robust Speech Recognition via Large-Scale Weak Supervision.
- Language models are unsupervised multitask learners.
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
- No Robots.
- DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale.
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.
- Mixture-of-Depths: Dynamically allocating compute in transformer-based language models.
- M2D2: A Massively Multi-domain Language Modeling Dataset.
- PanGu-Sigma: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing.
- Hash Layers For Large Sparse Models.
- XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models.
- WinoGrande: An Adversarial Winograd Schema Challenge at Scale.
- Multitask Prompted Training Enables Zero-Shot Task Generalization.
- SocialIQA: Commonsense Reasoning about Social Interactions.
- What Language Model to Train if You Have One Million GPU Hours?
- Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need.
- Noam Shazeer. 2020. GLU Variants Improve Transformer.
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.
- Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost.
- Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models.
- Scaling Vision-Language Models with Sparse Mixture of Experts.
- JetMoE: Reaching Llama2 Performance with 0.1M Dollars.
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.
- Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning.
- Snowflake. 2024a. Snowflake Arctic Cookbook Series: Exploring Mixture of Experts (MoE).
- Snowflake. 2024b. Snowflake Arctic: The Best LLM for Enterprise AI — Efficiently Intelligent, Truly Open.
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.
- Luca Soldaini and Kyle Lo. 2023. peS2o (Pretraining Efficiently on S2ORC) Dataset.
- KMMLU: Measuring Massive Multitask Language Understanding in Korean.
- RoFormer: Enhanced Transformer with Rotary Position Embedding.
- VL-BERT: Pre-training of Generic Visual-Linguistic Representations.
- Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM.
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them.
- CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge.
- Sparse Universal Transformer.
- Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies.
- Chameleon Team. 2024a. Chameleon: Mixed-Modal Early-Fusion Foundation Models.
- Gemini: A Family of Highly Capable Multimodal Models.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.
- Gemma: Open Models Based on Gemini Research and Technology.
- Gemma 2: Improving Open Language Models at a Practical Size.
- Jamba-1.5: Hybrid Transformer-Mamba Models at Scale.
- MosaicML NLP Team. 2023. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs.
- Qwen Team. 2024b. Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters”.
- Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models.
- LLaMA: Open and Efficient Foundation Language Models.
- Llama 2: Open Foundation and Fine-Tuned Chat Models.
- Zephyr: Direct Distillation of LM Alignment.
- Attention Is All You Need.
- Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model.
- OpenDevin: An Open Platform for AI Software Developers as Generalist Agents.
- How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources.
- HelpSteer2: Open-source dataset for training top-performing reward models.
- Finetuned Language Models Are Zero-Shot Learners.
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models.
- Crowdsourcing Multiple Choice Science Questions.
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.
- Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts.
- Yuan 2.0-M32: Mixture of Experts with Attention Router.
- xAI. 2024. Open Release of Grok-1.
- C-Pack: Packaged Resources To Advance General Chinese Embedding.
- Benchmark Data Contamination of Large Language Models: A Survey.
- OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models.
- Baichuan 2: Open Large-scale Language Models.
- Qwen2 Technical Report.
- SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering.
- BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting.
- MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models.
- Toward Inference-optimal Mixture-of-Expert Large Language Models.
- Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning.
- HellaSwag: Can a Machine Really Finish Your Sentence?
- Biao Zhang and Rico Sennrich. 2019. Root Mean Square Layer Normalization.
- MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series.
- TinyLlama: An Open-Source Small Language Model.
- BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts.
- OPT: Open Pre-trained Transformer Language Models.
- PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.
- Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658.
- Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training.
- LIMA: Less Is More for Alignment.
- Instruction-Following Evaluation for Large Language Models.
- Brainformers: Trading Simplicity for Efficiency.
- Mixture-of-Experts with Expert Choice Routing.
- Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models.
- ST-MoE: Designing Stable and Transferable Sparse Expert Models.
- Taming Sparsely Activated Transformer with Stochastic Experts.
- Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.