Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT (2402.16840v1)

Published 26 Feb 2024 in cs.CL
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT

Abstract: "Bigger the better" has been the predominant trend in recent LLMs development. However, LLMs do not suit well for scenarios that require on-device processing, energy efficiency, low memory footprint, and response efficiency. These requisites are crucial for privacy, security, and sustainable deployment. This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small LLMs (SLMs) for resource constrained devices. Our primary contribution is the introduction of an accurate and fully transparent open-source 0.5 billion (0.5B) parameter SLM, named MobiLlama, catering to the specific needs of resource-constrained computing with an emphasis on enhanced performance with reduced resource demands. MobiLlama is a SLM design that initiates from a larger model and applies a careful parameter sharing scheme to reduce both the pre-training and the deployment cost. Our work strives to not only bridge the gap in open-source SLMs but also ensures full transparency, where complete training data pipeline, training code, model weights, and over 300 checkpoints along with evaluation codes is available at : https://github.com/mbzuai-oryx/MobiLlama.

Efficient and Transparent Small LLMs: Introducing MobiLlama

Context and Motivation

The field of NLP has seen remarkable advancements with the development of LLMs, characterized by their vast parameter counts and exceptional abilities in handling complex language tasks. Despite their capabilities, the deployment of LLMs is hindered by their substantial computational and memory requirements, making them less feasible for applications constrained by resource availability, such as on-device processing and applications with stringent privacy, security, and energy efficiency considerations. Addressing these concerns, this paper introduces MobiLlama, a fully transparent, efficient, and open-source Small LLM (SLM) with 0.5 billion parameters, designed specifically for resource-constrained environments.

Related Work

Historically, the tendency has been toward constructing larger models to achieve better performance on NLP tasks. Although efficacious, this trend imposes limitations in terms of computational costs and model transparency. Recent efforts in the field of SLMs have started to explore the potential of downsizing without significantly sacrificing capabilities, focusing on model efficiency and the viability of deploying these models onto less capable hardware. However, a significant gap remains in the open-source availability of SLMs, limiting the scope for broader research and applications in diverse environments.

Proposed Methodology

Focusing on the reduction of redundancy and computational demand without compromising model performance, MobiLlama employs a shared Feed Forward Network (FFN) configuration across transformer blocks. This design significantly diminishes the parameter count while retaining the model's effectiveness across a wide range of NLP tasks. The utilized training data, architecture details, and comprehensive evaluation metrics are made fully accessible to ensure transparency and reproducibility, aligning with the need for open research in this domain.

Key Contributions

  1. Design Efficiency: MobiLlama exhibits a paradigm of shared FFN layers across transformer blocks, leading to a substantial reduction in parameters while maintaining competitive performance across various benchmarks.
  2. Transparency and Accessibility: The entire training pipeline, including code, data, and checkpoints, is made available, fostering an open research environment.
  3. Benchmarking Performance: MobiLlama outperforms existing SLMs in its parameter class across nine distinct benchmarks, showcasing the effectiveness of the model in diverse NLP tasks.

Implementation Details

Underpinning MobiLlama is a strategic architecture configuration that balances the trade-off between model depth and width, ensuring optimal performance without an excessive increase in parameters or computational demand. The model is pre-trained on the versatile and rich Amber dataset, encompassing a broad spectrum of linguistic sources, to ensure a comprehensive understanding and representation of language nuances.

Evaluation and Results

Evaluating MobiLlama against existing models and baselines demonstrates its superior performance, particularly in tasks requiring complex language comprehension and generation. Moreover, the model achieves remarkable efficiency in deployment, showcasing lower energy consumption and reduced memory requirements on resource-constrained devices compared to larger counterparts.

Future Directions

While MobiLlama represents a leap towards more practical and deployable SLMs, future work may explore further optimization of the shared FFN design, expansion into more diverse tasks, and continued efforts to enhance the model's understanding and generation capabilities. Additionally, addressing potential biases and improving the model's fairness and robustness are vital areas for ongoing research.

Conclusion

MobiLlama stands as a testament to the feasibility of developing efficient, effective, and fully transparent SLMs. By making strides towards models that are not only computationally economical but also accessible and open for extensive research, MobiLlama contributes to the democratization and advancement of the field of NLP, inviting further exploration and innovation in the development of SLMs suited for a broader range of applications.

Acknowledgements

The development and evaluation of MobiLlama were facilitated by significant computational resources and collaborative efforts, highlighting the collective progress toward more sustainable and inclusive AI research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. The falcon series of open language models.
  2. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024.
  3. Leveraging redundancy in attention with reuse transformers. arXiv preprint arXiv:2110.06821.
  4. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  5. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
  6. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. If you use this software, please cite it using these metadata.
  7. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  8. Together Computer. 2023. Redpajama: An open source recipe to reproduce llama training dataset.
  9. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
  10. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208.
  11. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
  12. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
  13. A framework for few-shot language model evaluation.
  14. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC.
  15. Olmo: Accelerating the science of language models. arXiv preprint.
  16. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  17. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. The Journal of Machine Learning Research, 22(1):10882–11005.
  18. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709.
  19. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
  20. Starcoder: may the source be with you!
  21. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463.
  22. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  23. Few-shot learning with multilingual language models. CoRR, abs/2112.10668.
  24. Visual instruction tuning.
  25. Llm360: Towards fully transparent open-source llms.
  26. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  27. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
  28. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
  29. Crows-pairs: A challenge dataset for measuring social biases in masked language models. arXiv preprint arXiv:2010.00133.
  30. Interpretability-aware redundancy reduction for vision transformers. US Patent App. 17/559,053.
  31. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  32. One wide feedforward is all you need. arXiv preprint arXiv:2309.01826.
  33. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  34. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
  35. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728.
  36. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
  37. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326.
  38. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
  39. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  40. Lamini-lm: A diverse herd of distilled models from large-scale instructions. CoRR, abs/2304.14402.
  41. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR.
  42. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  43. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
  44. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
  45. Opt: Open pre-trained transformer language models.
  46. A survey of large language models. arXiv preprint arXiv:2303.18223.
  47. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
  48. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Omkar Thawakar (15 papers)
  2. Ashmal Vayani (8 papers)
  3. Salman Khan (244 papers)
  4. Hisham Cholakal (2 papers)
  5. Rao M. Anwer (4 papers)
  6. Michael Felsberg (75 papers)
  7. Tim Baldwin (6 papers)
  8. Eric P. Xing (192 papers)
  9. Fahad Shahbaz Khan (225 papers)
Citations (18)