Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TeleChat Technical Report (2401.03804v2)

Published 8 Jan 2024 in cs.CL and cs.AI
TeleChat Technical Report

Abstract: In this technical report, we present TeleChat, a collection of LLMs with parameters of 3 billion, 7 billion and 12 billion. It includes pretrained LLMs as well as fine-tuned chat models that is aligned with human preferences. TeleChat is initially pretrained on an extensive corpus containing a diverse collection of texts from both English and Chinese languages, including trillions of tokens. Subsequently, the model undergoes fine-tuning to align with human preferences, following a detailed methodology that we describe. We evaluate the performance of TeleChat on various tasks, including language understanding, mathematics, reasoning, code generation, and knowledge-based question answering. Our findings indicate that TeleChat achieves comparable performance to other open-source models of similar size across a wide range of public benchmarks. To support future research and applications utilizing LLMs, we release the fine-tuned model checkpoints of TeleChat's 7B and 12B variant, along with code and a portion of our pretraining data, to the public community.

An Overview of TeleChat: Technical Report

The paper "TeleChat Technical Report" presents TeleChat, a suite of LLMs designed with parameter scales of 3 billion, 7 billion, and 12 billion. The paper elaborates on the comprehensive steps involved in pretraining and fine-tuning these models to align with human preferences. This essay offers a detailed overview of TeleChat's design, pretraining, supervised fine-tuning, and specialized techniques, with an emphasis on the empirical outcomes and implications of these LLMs.

Context and Motivation

The proliferation of LLMs in natural language processing and understanding has dramatically advanced since the launch of models like ChatGPT. However, many high-profile models remain proprietary with restricted data sharing policies, creating a barrier for widespread research and development. In this context, TeleChat aims to fill the gap by offering a set of open-source LLMs, focusing on reproducibility and responsible AI development, particularly in the field of chat-based applications.

Design and Architecture

TeleChat employs an autoregressive transformer model architecture inspired by GPT-3 but incorporates several modifications from models like LLaMA and BLOOM. Key features include:

  • Rotary Position Embeddings (RoPE): To efficiently encode positional information and extend the context window to 96k tokens. The model also utilizes Flash Attention v2 for computational efficiency.
  • Normalizations: RMSNorm and pre-normalization methods are employed to enhance training stability.
  • Activations: The model uses SwiGLU activation functions, following recent trends in transformer optimization.

Detailed architectural parameters for TeleChat's various configurations are summarized, highlighting the model layers, hidden sizes, feed-forward network sizes, and attention heads.

Pretraining

Data collection and preprocessing form the bedrock of TeleChat's pretraining. The model is pretrained on a vast corpus exceeding trillions of tokens in both English and Chinese. The preprocessing pipeline entails meticulous data cleaning steps such as rule-based filtering, deduplication, and categorization. The tokenization is conducted using a BBPE algorithm, establishing a tokenizer adept at handling diverse data, including code and mathematical text.

The training is executed using a cosine learning rate schedule with a ramp-up batch size to ensure stable and efficient convergence. Batch generation strategies are implemented to ensure diversity and coherence across different contexts.

Supervised Fine-Tuning and Alignment

Supervised fine-tuning (SFT) is employed to enhance the model's interaction quality and utility in practical applications:

  • Data Annotation: An extensive annotation process utilizing human annotators ensures high-quality, domain-specific annotated data.
  • Training Methodology: Techniques such as noisy embedding fine-tuning (NEFTUNE) and multi-stage long-context training are introduced. NEFTUNE helps prevent overfitting by adding noise to input embeddings, which is particularly effective when training data is limited.
  • Alignment with Human Preferences: Reinforcement learning (RL) through reward models and Proximal Policy Optimization (PPO) is utilized to fine-tune the model's outputs to be safe, useful, and aligned with human expectations.

Empirical Evaluation

TeleChat's performance is rigorously evaluated against a suite of benchmarks, demonstrating competitive results:

  • Examination Test Performance: TeleChat achieves superior rankings in various examination datasets such as MMLU, CMMLU, C-Eval, GAOKAO-Bench, and AGIEVAL when compared to other models of similar sizes.
  • Understanding and Reasoning: The model exhibits robust performance on traditional NLP tasks, excelling in datasets like CSL, EPRSTMT, CHID, GSM8K, MATH, and HumanEval.
  • Mitigating Hallucinations: Integration with Knowledge Graphs (KG) significantly improves the model’s accuracy in factual questioning tasks, as demonstrated on the CCKS-2020 Knowledge Graph based Q&A task.

Engineering and Practical Contributions

TeleChat leverages advanced parallel computing techniques, employing the Megatron-DeepSpeed framework to achieve efficient model training across large-scale distributed systems. This includes tensor parallelism, pipeline parallelism, and data parallelism, optimized by Zero Redundancy Optimizer (ZeRO).

Implications and Future Directions

The release of TeleChat's fine-tuned model checkpoints, codebase, and portions of the pretraining data stands to bolster future research and foster innovations in AI-driven conversational agents. The model's robust performance across multiple benchmarks and its open-source nature significantly contribute to the democratization of advanced LLM technologies.

Future developments could explore improved context handling and further mitigation of hallucinations using enhanced retrieval-augmented generation approaches.

Conclusion

In summary, the TeleChat technical report provides a comprehensive account of developing, training, and evaluating a series of competitive LLMs aligned with human preferences. By prioritizing transparency and reproducibility, TeleChat contributes meaningful advancements to the field of large-scale LLMing and conversational AI.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (36)
  1. Zihan Wang (181 papers)
  2. Xinzhang Liu (3 papers)
  3. Shixuan Liu (12 papers)
  4. Yitong Yao (2 papers)
  5. Yuyao Huang (9 papers)
  6. Zhongjiang He (11 papers)
  7. Xuelong Li (268 papers)
  8. Yongxiang Li (22 papers)
  9. Zhonghao Che (1 paper)
  10. Zhaoxi Zhang (19 papers)
  11. Yan Wang (733 papers)
  12. Xin Wang (1306 papers)
  13. Luwen Pu (1 paper)
  14. Ruiyu Fang (2 papers)
  15. Yu Zhao (207 papers)
  16. Jie Zhang (846 papers)
  17. Xiaomeng Huang (31 papers)
  18. Zhilong Lu (1 paper)
  19. Jiaxin Peng (4 papers)
  20. Wenjun Zheng (8 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com