Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 98 tok/s Pro
GPT OSS 120B 424 tok/s Pro
Kimi K2 164 tok/s Pro
2000 character limit reached

SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling (2312.15166v3)

Published 23 Dec 2023 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce SOLAR 10.7B, a LLM with 10.7 billion parameters, demonstrating superior performance in various NLP tasks. Inspired by recent efforts to efficiently up-scale LLMs, we present a method for scaling LLMs called depth up-scaling (DUS), which encompasses depthwise scaling and continued pretraining. In contrast to other LLM up-scaling methods that use mixture-of-experts, DUS does not require complex changes to train and inference efficiently. We show experimentally that DUS is simple yet effective in scaling up high-performance LLMs from small ones. Building on the DUS model, we additionally present SOLAR 10.7B-Instruct, a variant fine-tuned for instruction-following capabilities, surpassing Mixtral-8x7B-Instruct. SOLAR 10.7B is publicly available under the Apache 2.0 license, promoting broad access and application in the LLM field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Large language models (llm) and chatgpt: what will the impact on nuclear medicine be? European journal of nuclear medicine and molecular imaging, 50(6):1549–1552.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  3. Chatgpt: Applications, opportunities, and threats. In 2023 Systems and Information Engineering Design Symposium (SIEDS), pages 274–279. IEEE.
  4. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  7. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  8. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
  9. Investigating data contamination in modern benchmarks for large language models. arXiv preprint arXiv:2311.09783.
  10. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
  11. Mohammad Fraiwan and Natheer Khasawneh. 2023. A review of chatgpt applications in education, marketing, software engineering, and healthcare: Benefits, drawbacks, and research directions. arXiv preprint arXiv:2305.00237.
  12. Megablocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems, 5.
  13. Andrea Gesmundo and Kaitlin Maile. 2023. Composable function-preserving expansions for transformer architectures. arXiv preprint arXiv:2308.06103.
  14. Shahriar Golchin and Mihai Surdeanu. 2023. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493.
  15. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  16. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
  17. Scaling laws for transfer. arXiv preprint arXiv:2102.01293.
  18. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems, 5.
  19. Intel. 2023. Supervised fine-tuning and direct preference optimization on intel gaudi2.
  20. Camels in a changing climate: Enhancing lm adaptation with tulu 2.
  21. Mistral 7b. arXiv preprint arXiv:2310.06825.
  22. No train no gain: Revisiting efficient training algorithms for transformer-based language models. arXiv preprint arXiv:2307.06440.
  23. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  24. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055.
  25. Wing Lian. 2023. https://huggingface.co/winglian/omega-3b.
  26. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252.
  27. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
  28. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
  29. OpenAI. 2023. Gpt-4 technical report.
  30. Reusing pretrained models by multi-linear operators for efficient training. arXiv preprint arXiv:2310.10699.
  31. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  32. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  33. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  34. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  35. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. arXiv preprint arXiv:2310.18018.
  36. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
  37. Chatgpt applications in medical, dental, pharmacy, and public health education: A descriptive study highlighting the advantages and limitations. Narra J, 3(1):e103–e103.
  38. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
  39. Mixture models for diverse machine translation: Tricks of the trade. In International conference on machine learning, pages 5719–5728. PMLR.
  40. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789.
  41. Ken Shoemake. 1985. Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques, pages 245–254.
  42. Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR.
  43. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  44. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
  45. Learning to grow pretrained models for efficient transformer training. arXiv preprint arXiv:2303.00980.
  46. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  47. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  48. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  49. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  50. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  51. Ties-merging: Resolving interference when merging models. In Thirty-seventh Conference on Neural Information Processing Systems.
  52. Large language models as optimizers. arXiv preprint arXiv:2309.03409.
  53. 2x faster language model pre-training via masked structural growth. arXiv preprint arXiv:2305.02869.
  54. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  55. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302.
  56. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800.
  57. Survey of technology in network security situation awareness. Sensors, 23(5):2608.
  58. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792.
  59. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964.
  60. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
Citations (97)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a novel depth up-scaling method that structurally expands transformer architectures and recovers performance with continued pretraining.
  • It demonstrates record-breaking performance by outperforming models like Llama 2 and Mistral 7B on benchmarks including ARC, HellaSwag, and GSM8K.
  • The open-source release encourages broader NLP research by enabling efficient scaling with reduced computational complexity.

Overview of SOLAR 10.7B: Scaling LLMs

The paper presents SOLAR 10.7B, a LLM incorporating a new methodology termed depth up-scaling (DUS). This approach achieves significant scalability and performance efficiency without the complexities seen in previous techniques such as the mixture-of-experts (MoE) framework.

Depth Up-Scaling Methodology

Depth up-scaling revolves around enhancing model depth rather than merely expanding model size horizontally. The method commences with a base model using a transformer architecture, followed by two core steps:

  1. Depthwise Scaling: This involves duplicating the existing model's architecture to form an expanded model. A portion of layers is removed from both ends of the original and duplicated architectures, then concatenated to yield the final, scaled model. In practice, the SOLAR model follows specifics where n=32n=32, s=48s=48, and m=8m=8, indicating a transition from 32 to 48 layers with 8 layers removed from each duplicate end.
  2. Continued Pretraining: Despite structural expansion, initial scaling degrades performance, necessitating further pretraining. This step not only recovers pre-scaling performance but capitalizes on the scalable architecture to exceed initial benchmarks.

Results and Performance

SOLAR 10.7B and its fine-tuned variant SOLAR 10.7B-Instruct both demonstrate superior performance over competitive LLMs such as Llama 2 and Mistral 7B across multiple natural language processing benchmarks. Notable achievements include outperforming the Mixtral-8x7B-Instruct in instruction-following tasks, underscoring the efficiency and capacity enhancements that DUS contributes, particularly in computational efficiency and task adaptability.

Evaluation and Implications

The model's efficacy is substantiated through extensive evaluations across benchmarks like ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, and GSM8K. The performance improvements reaffirm the underlying potential of DUS to not only effectively utilize pretrained weights but also optimize model architecture without resorting to additional training modules or software enhancements.

The release of SOLAR 10.7B under an open-source license presents practical implications for broader applicability in NLP research and development. It encourages further exploration into transformer scalability, accommodating diverse computational resources without imposing prohibitive hardware or software adjustments.

Future Directions

The work paves pathways for more explorative research into deep scaling methods that might further enhance performance efficiently. Speculatively, future advancements may integrate additional scaling nuances without complicating existing frameworks, potentially merging with or transcending current methods like MoE, particularly as hardware capabilities evolve.

Conclusion

SOLAR 10.7B demonstrates a promising leap in scaling LLMs through the implementation of depth up-scaling. The model's capacity to efficiently scale and improve performance suggests significant theoretical and practical potentials in the optimization of LLM architectures, fostering further innovations in AI and natural language processing fields. The paper's contribution reaffirms the essential advocacy for computational efficiency and the scalability of LLMs in contemporary AI research.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com