SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling (2312.15166v3)
Abstract: We introduce SOLAR 10.7B, a LLM with 10.7 billion parameters, demonstrating superior performance in various NLP tasks. Inspired by recent efforts to efficiently up-scale LLMs, we present a method for scaling LLMs called depth up-scaling (DUS), which encompasses depthwise scaling and continued pretraining. In contrast to other LLM up-scaling methods that use mixture-of-experts, DUS does not require complex changes to train and inference efficiently. We show experimentally that DUS is simple yet effective in scaling up high-performance LLMs from small ones. Building on the DUS model, we additionally present SOLAR 10.7B-Instruct, a variant fine-tuned for instruction-following capabilities, surpassing Mixtral-8x7B-Instruct. SOLAR 10.7B is publicly available under the Apache 2.0 license, promoting broad access and application in the LLM field.
- Large language models (llm) and chatgpt: what will the impact on nuclear medicine be? European journal of nuclear medicine and molecular imaging, 50(6):1549–1552.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Chatgpt: Applications, opportunities, and threats. In 2023 Systems and Information Engineering Design Symposium (SIEDS), pages 274–279. IEEE.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
- Investigating data contamination in modern benchmarks for large language models. arXiv preprint arXiv:2311.09783.
- Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
- Mohammad Fraiwan and Natheer Khasawneh. 2023. A review of chatgpt applications in education, marketing, software engineering, and healthcare: Benefits, drawbacks, and research directions. arXiv preprint arXiv:2305.00237.
- Megablocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems, 5.
- Andrea Gesmundo and Kaitlin Maile. 2023. Composable function-preserving expansions for transformer architectures. arXiv preprint arXiv:2308.06103.
- Shahriar Golchin and Mihai Surdeanu. 2023. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
- Scaling laws for transfer. arXiv preprint arXiv:2102.01293.
- Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems, 5.
- Intel. 2023. Supervised fine-tuning and direct preference optimization on intel gaudi2.
- Camels in a changing climate: Enhancing lm adaptation with tulu 2.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- No train no gain: Revisiting efficient training algorithms for transformer-based language models. arXiv preprint arXiv:2307.06440.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055.
- Wing Lian. 2023. https://huggingface.co/winglian/omega-3b.
- Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
- OpenAI. 2023. Gpt-4 technical report.
- Reusing pretrained models by multi-linear operators for efficient training. arXiv preprint arXiv:2310.10699.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. arXiv preprint arXiv:2310.18018.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Chatgpt applications in medical, dental, pharmacy, and public health education: A descriptive study highlighting the advantages and limitations. Narra J, 3(1):e103–e103.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
- Mixture models for diverse machine translation: Tricks of the trade. In International conference on machine learning, pages 5719–5728. PMLR.
- Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789.
- Ken Shoemake. 1985. Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques, pages 245–254.
- Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
- Learning to grow pretrained models for efficient transformer training. arXiv preprint arXiv:2303.00980.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Ties-merging: Resolving interference when merging models. In Thirty-seventh Conference on Neural Information Processing Systems.
- Large language models as optimizers. arXiv preprint arXiv:2309.03409.
- 2x faster language model pre-training via masked structural growth. arXiv preprint arXiv:2305.02869.
- Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800.
- Survey of technology in network security situation awareness. Sensors, 23(5):2608.
- Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792.
- Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.