360Zhinao Technical Report

Published 22 May 2024 in cs.CL and cs.AI | (2405.13386v1)

Abstract: We present 360Zhinao models with 7B parameter size and context lengths spanning 4K, 32K and 360K, all available at https://github.com/Qihoo360/360zhinao. For rapid development in pretraining, we establish a stable and sensitive ablation environment to evaluate and compare experiment runs with minimal model size. Under such guidance, we perfect our data cleaning and composition strategies to pretrain $\texttt{360Zhinao-7B-Base}$ on 3.4T tokens. We also mainly emphasize data during alignment, where we strive to balance quantity and quality with filtering and reformatting. With tailored data, 360Zhinao-7B's context window is easily extended to 32K and 360K. RMs and RLHF are trained following SFT and credibly applied to specific tasks. All together these contributions lead to 360Zhinao-7B's competitive performance among models of similar size.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces an advanced 360Zhinao model with 7B parameters and a 360K token context to significantly enhance language comprehension.
The paper details an innovative pretraining pipeline that processes 3.4 trillion tokens through robust cleaning, deduplication, and data mixture strategies.
The paper demonstrates effective alignment through supervised fine-tuning and reinforcement learning from human feedback, improving multi-turn dialogue and long-context performance.

Technical Evaluation of 360Zhinao: An Open-Source LLM

The paper "360Zhinao Technical Report" delineates the comprehensive development and evaluation of the 360Zhinao LLMs, focusing on model sizes with a 7 billion parameter count and extended context lengths of up to 360K tokens. This report provides a detailed exposition on both pretraining and alignment stages, showcasing the methodologies and results achieved by the 360Zhinao Team.

Pretraining: Data Strategy and Model Architecture

In the pretraining stage, a robust data processing pipeline was established to handle and refine a massive corpus of 3.4 trillion tokens. The data pipeline was segmented into preparation, cleaning, deduplication, and data mixture phases, ensuring high data integrity and efficiency.

Data Preparation and Cleaning: The preprocessing involved URL filtering, language recognition, junk text removal, and PII filtering. Notably, document-level, paragraph-level, and sentence-level deduplication strategies were systematically applied, resulting in increased data diversity and model generalization.
Data Mixture: The corpus for pretraining was composed of diverse sources including web pages, books, and technical documents. A balance was achieved between data diversity, quality, and efficiency through strategic oversampling and deduplication methods, as corroborated by detailed ablation studies.
Model Architecture: The model adopts the transformer-based Pre-Norm architecture with RMSNorm, SwiGLU activation, and Rotary Position Embedding (RoPE) for superior performance. The use of Flash Attention for training efficiency and the AdamW optimizer with a cosine decay learning rate further support stable and effective pretraining.

The model's performance was benchmarked across several widely recognized challenges, affirming its competitive edge in knowledge comprehension and reasoning tasks, particularly within multi-lingual contexts.

Alignment: Supervised Fine-Tuning and RLHF

The alignment phase involved refining the pretrained base model using Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from Human Feedback (RLHF).

Supervised Fine-Tuning (SFT): Initial SFT approaches emphasized data quantity, but later shifted towards ensuring high quality data, particularly by integrating PoT data for enhanced performance on mathematical and reasoning tasks. The resulting models demonstrated top-tier results on multi-turn dialogue benchmarks like MT-Bench, underscoring the efficacy of curated datasets.
Long Context Extension: The context length extension from 4K to 32K and further to 360K was efficiently accomplished using the RoPE-base modification, supported by curated long-format data. The extended models performed notably well on LongBench and NIAH tasks, validating their utility in handling extensive context in practical scenarios.
Reinforcement Learning from Human Feedback: The infrastructure for RLHF has been refined to accommodate large-scale PPO on expansive models. While general task improvements through PPO were moderate, task-specific enhancements, notably in translation and code-switch scenarios, were substantive. Furthermore, using RMs as evaluators and filters added substantial utility to ongoing model refinement efforts.

Contributions and Future Implications

The 360Zhinao models contribute significantly to the improvement of open-source LLM capabilities, both in terms of sophistication and accessibility. The comprehensive ablation studies effectively guide data strategy optimizations, while the release of models and datasets fosters transparency and community collaboration.

The paper outlines potential extensions and the iterative nature of LLM development, emphasizing ongoing improvements in larger models and more refined alignment techniques. As such efforts progress, the practical and theoretical expansions of LLMs like 360Zhinao will continue to impact AI applications in various domains, from natural language understanding to practical deployment in end-user products like AI browsers and search engines. This research provides a blueprint for future endeavors aiming to elevate the precision and applicability of LLMs in addressing complex real-world tasks.

Markdown