Papers
Topics
Authors
Recent
Search
2000 character limit reached

TeleChat Technical Report

Published 8 Jan 2024 in cs.CL and cs.AI | (2401.03804v2)

Abstract: In this technical report, we present TeleChat, a collection of LLMs with parameters of 3 billion, 7 billion and 12 billion. It includes pretrained LLMs as well as fine-tuned chat models that is aligned with human preferences. TeleChat is initially pretrained on an extensive corpus containing a diverse collection of texts from both English and Chinese languages, including trillions of tokens. Subsequently, the model undergoes fine-tuning to align with human preferences, following a detailed methodology that we describe. We evaluate the performance of TeleChat on various tasks, including language understanding, mathematics, reasoning, code generation, and knowledge-based question answering. Our findings indicate that TeleChat achieves comparable performance to other open-source models of similar size across a wide range of public benchmarks. To support future research and applications utilizing LLMs, we release the fine-tuned model checkpoints of TeleChat's 7B and 12B variant, along with code and a portion of our pretraining data, to the public community.

Citations (2)

Summary

  • The paper presents open-source LLMs with 3B, 7B, and 12B parameters, detailing comprehensive pretraining and alignment with human preferences.
  • It employs advanced techniques such as rotary position embeddings, Flash Attention v2, and noisy embedding fine-tuning to enhance model stability and efficiency.
  • Empirical evaluations demonstrate competitive performance across various benchmarks, underpinning TeleChat’s commitment to reproducibility and responsible AI development.

An Overview of TeleChat: Technical Report

The paper "TeleChat Technical Report" presents TeleChat, a suite of LLMs designed with parameter scales of 3 billion, 7 billion, and 12 billion. The paper elaborates on the comprehensive steps involved in pretraining and fine-tuning these models to align with human preferences. This essay offers a detailed overview of TeleChat's design, pretraining, supervised fine-tuning, and specialized techniques, with an emphasis on the empirical outcomes and implications of these LLMs.

Context and Motivation

The proliferation of LLMs in natural language processing and understanding has dramatically advanced since the launch of models like ChatGPT. However, many high-profile models remain proprietary with restricted data sharing policies, creating a barrier for widespread research and development. In this context, TeleChat aims to fill the gap by offering a set of open-source LLMs, focusing on reproducibility and responsible AI development, particularly in the field of chat-based applications.

Design and Architecture

TeleChat employs an autoregressive transformer model architecture inspired by GPT-3 but incorporates several modifications from models like LLaMA and BLOOM. Key features include:

  • Rotary Position Embeddings (RoPE): To efficiently encode positional information and extend the context window to 96k tokens. The model also utilizes Flash Attention v2 for computational efficiency.
  • Normalizations: RMSNorm and pre-normalization methods are employed to enhance training stability.
  • Activations: The model uses SwiGLU activation functions, following recent trends in transformer optimization.

Detailed architectural parameters for TeleChat's various configurations are summarized, highlighting the model layers, hidden sizes, feed-forward network sizes, and attention heads.

Pretraining

Data collection and preprocessing form the bedrock of TeleChat's pretraining. The model is pretrained on a vast corpus exceeding trillions of tokens in both English and Chinese. The preprocessing pipeline entails meticulous data cleaning steps such as rule-based filtering, deduplication, and categorization. The tokenization is conducted using a BBPE algorithm, establishing a tokenizer adept at handling diverse data, including code and mathematical text.

The training is executed using a cosine learning rate schedule with a ramp-up batch size to ensure stable and efficient convergence. Batch generation strategies are implemented to ensure diversity and coherence across different contexts.

Supervised Fine-Tuning and Alignment

Supervised fine-tuning (SFT) is employed to enhance the model's interaction quality and utility in practical applications:

  • Data Annotation: An extensive annotation process utilizing human annotators ensures high-quality, domain-specific annotated data.
  • Training Methodology: Techniques such as noisy embedding fine-tuning (NEFTUNE) and multi-stage long-context training are introduced. NEFTUNE helps prevent overfitting by adding noise to input embeddings, which is particularly effective when training data is limited.
  • Alignment with Human Preferences: Reinforcement learning (RL) through reward models and Proximal Policy Optimization (PPO) is utilized to fine-tune the model's outputs to be safe, useful, and aligned with human expectations.

Empirical Evaluation

TeleChat's performance is rigorously evaluated against a suite of benchmarks, demonstrating competitive results:

  • Examination Test Performance: TeleChat achieves superior rankings in various examination datasets such as MMLU, CMMLU, C-Eval, GAOKAO-Bench, and AGIEVAL when compared to other models of similar sizes.
  • Understanding and Reasoning: The model exhibits robust performance on traditional NLP tasks, excelling in datasets like CSL, EPRSTMT, CHID, GSM8K, MATH, and HumanEval.
  • Mitigating Hallucinations: Integration with Knowledge Graphs (KG) significantly improves the model’s accuracy in factual questioning tasks, as demonstrated on the CCKS-2020 Knowledge Graph based Q&A task.

Engineering and Practical Contributions

TeleChat leverages advanced parallel computing techniques, employing the Megatron-DeepSpeed framework to achieve efficient model training across large-scale distributed systems. This includes tensor parallelism, pipeline parallelism, and data parallelism, optimized by Zero Redundancy Optimizer (ZeRO).

Implications and Future Directions

The release of TeleChat's fine-tuned model checkpoints, codebase, and portions of the pretraining data stands to bolster future research and foster innovations in AI-driven conversational agents. The model's robust performance across multiple benchmarks and its open-source nature significantly contribute to the democratization of advanced LLM technologies.

Future developments could explore improved context handling and further mitigation of hallucinations using enhanced retrieval-augmented generation approaches.

Conclusion

In summary, the TeleChat technical report provides a comprehensive account of developing, training, and evaluating a series of competitive LLMs aligned with human preferences. By prioritizing transparency and reproducibility, TeleChat contributes meaningful advancements to the field of large-scale language modeling and conversational AI.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 38 likes about this paper.