Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tele-FLM Technical Report (2404.16645v1)

Published 25 Apr 2024 in cs.CL and cs.AI
Tele-FLM Technical Report

Abstract: LLMs have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual LLM that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. Tele-FLM demonstrates superior multilingual LLMing abilities, measured by BPB on textual corpus. Besides, in both English and Chinese foundation model evaluation, it is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B. In addition to the model weights, we share the core designs, engineering practices, and training details, which we expect to benefit both the academic and industrial communities.

Efficient Scaling of Multilingual LLMs: Introducing Tele-FLM (FLM-2)

Introduction

This paper introduces Tele-FLM, a 52-billion parameter, open-sourced multilingual LLM that demonstrates efficient scaling and superior multilingual capabilities. The model optimally balances the costs and computational resources typically associated with training large-scale models through a streamlined model-producing pipeline and advanced hyperparameter search methodologies.

Pre-training Details

Data Processing and Model Configurations:

  • The training dataset comprises texts from diverse domains, processed using a robust pipeline to ensure high-quality and uniform distribution, especially focusing on English and Chinese texts.
  • Modifications from its predecessor, FLM-101B, include optimized normalization techniques and activation functions, contributing to its stable training dynamics.

Parallelism and Training Infrastructure:

  • Tele-FLM employs 3D parallel training, combining data, tensor, and pipeline parallelism to optimize computational resources across a cluster of 896 Nvidia A800 GPUs.
  • The utilization of advanced parallel training techniques facilitates efficient scaling and robust training dynamics, enabling the model to train with minimal restarts and computational waste.

Performance and Evaluation

Benchmark Performance:

  • Tele-FLM achieves impressive scores on both English and Chinese LLMing benchmarks, demonstrating strong compression abilities and reducing the Bits-Per-Byte (BPB) metric, which is a crucial performance indicator for LLMs.
  • The model performs on par with or better than larger models like Llama2-70B and Qwen1.5-72B on various datasets, substantiating its robust multilingual capabilities.

Evaluation Insights:

  • Detailed evaluation results highlight Tele-FLM's consistent performance across English and Chinese benchmarks.
  • It shows particular prowess in tasks requiring in-depth language understanding and reasoning, further evidenced by its performance in specialized benchmarks like HumanEval and Big-Bench Hard.

Discussion and Implications

General Observations:

  • High-quality, diversified pre-training data significantly contributes to the model's comprehensive language understanding capabilities.
  • Effective hyperparameter tuning, especially using sophisticated methodologies like the μ\muP model, plays a crucial role in enhancing model performance and ensuring efficient scaling.

Technical Insights:

  • Tele-FLM inherits and improves upon the low carbon techniques and advanced pre-training objectives from the FLM family, ensuring an eco-friendly yet powerful modeling approach.
  • The provided documentation of model architecture, pre-training details, and training dynamics offers valuable insights for both academic research and practical applications in the AI community.

Future Directions

The authors plan to continue refining Tele-FLM's capabilities to broaden its application spectrum and improve its efficiency. Future developments may include exploring larger model scales and enhancing the model's adaptability across more diverse languages and tasks.

Conclusions

The introduction of Tele-FLM marks significant progress in the development of scalable and efficient LLMs. By offering detailed insights and open-sourcing the model, the paper contributes valuably to the ongoing research and development in the field of AI and LLMs. Furthermore, the strategic improvements in model training and resource utilization present a promising direction for future large-scale AI model development.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (20)
  1. Xiang Li (1002 papers)
  2. Yiqun Yao (14 papers)
  3. Xin Jiang (242 papers)
  4. Xuezhi Fang (11 papers)
  5. Chao Wang (555 papers)
  6. Xinzhang Liu (3 papers)
  7. Zihan Wang (181 papers)
  8. Yu Zhao (207 papers)
  9. Xin Wang (1306 papers)
  10. Yuyao Huang (9 papers)
  11. Shuangyong Song (18 papers)
  12. Yongxiang Li (22 papers)
  13. Zheng Zhang (486 papers)
  14. Bo Zhao (242 papers)
  15. Aixin Sun (99 papers)
  16. Yequan Wang (44 papers)
  17. Zhongjiang He (11 papers)
  18. Zhongyuan Wang (105 papers)
  19. Xuelong Li (268 papers)
  20. Tiejun Huang (130 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com