Qwen2 Technical Report (2407.10671v4)

Published 15 Jul 2024 in cs.CL and cs.AI

Abstract: This report introduces the Qwen2 series, the latest addition to our LLMs and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned LLMs, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base LLM. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach. To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face and ModelScope, and the supplementary materials including example code on GitHub. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors.

PDF HTML Abstract

An Expert Review of "Qwen2 Technical Report"

The "Qwen2 Technical Report" details the development and evaluation of the Qwen2 series, extensive LLMs and multi-modal models from the Qwen Team at Alibaba Group. This suite includes both foundational and instruction-tuned models, and covers a wide range of model sizes, from 0.5 billion to 72 billion parameters. These models are distinguished by dense and Mixture-of-Experts (MoE) architectures. In this essay, we will provide an expert overview of the key aspects of the Qwen2 series, including the architecture, training regimens, evaluations, and notable achievements.

Model Architecture and Design

The Qwen2 series builds upon the Transformer architecture, integrating innovative techniques for performance and scalability. Among the dense models, key improvements include the implementation of Grouped Query Attention (GQA) to optimize KV cache usage, Dual Chunk Attention (DCA) for managing long context windows, and YARN for better length extrapolation.

For the MoE model, Qwen2-57B-A14B, a unique expert routing mechanism is employed. This design involves multiple smaller-scale experts rather than the traditional single larger FFNs, allowing for more dynamic and diverse expert utilization. Key considerations such as expert granularity, routing, and initialization are geared towards enhancing model performance and adaptability.

Training and Post-Training

Qwen2 models were pre-trained on a massive dataset exceeding 7 trillion tokens, with an emphasis on quality and diversity, particularly in coding and mathematics. The architecture modifications, such as the use of RoPE for positional encoding, were crucial in managing long-context processing capabilities up to 32,768 tokens. The models also underwent supervised fine-tuning and reinforcement learning from human feedback to align them closely with human preferences.

In the post-training phase, both supervised fine-tuning and direct preference optimization (DPO) were employed, leveraging large-scale multi-turn instruction-following datasets. Automated data synthesis techniques, such as rejection sampling and execution feedback, significantly improved the quality of the training data without heavy reliance on human annotation. This process enhanced the coding, mathematical, and multilingual performance of the models.

Evaluation and Performance

Evaluations were comprehensively performed across multiple benchmark tasks. Qwen2 models demonstrated superior performance across various parameters such as natural language understanding, coding, mathematics, and multilingual capabilities. The flagship model, Qwen2-72B, notably achieved strong results, exemplified by scores such as 84.2 on MMLU and 64.6 on HumanEval.

In-Depth Model Comparisons

Qwen2-72B: Outperformed numerous competing models, including Mixtral-8x22B and Llama-3-70B, with substantial improvements noted across both coding and mathematics tasks.
Qwen2-57B-A14B (MoE): Comparable to dense models with around 30 billion parameters, demonstrating competitive performance overall, and superior capabilities in coding and mathematics tasks.
Qwen2-7B: Demonstrated significant advantages especially in coding and multilingual evaluations, outperforming strong baselines like Llama-3-8B.

Long Context and Multilingual Capabilities

Qwen2 models integrate enhancements to handle extended contexts effectively. Techniques such as Grouped Query Attention and Dual Chunk Attention were highlighted in evaluations such as Needle in a Haystack and NeedleBench, which confirmed their proficiency with context lengths up to 128K tokens.

Multilingual performance was rigorously evaluated across several languages, where Qwen2-72B-Instruct demonstrated competitive scores against prominent proprietary models such as GPT-4-Turbo and Claude-3-Opus. Evaluations confirmed improvements in handling diverse languages including Arabic, French, Russian, and Indonesian.

Safety and Responsibility

Safety evaluations addressed critical issues like illegal behaviors and content moderation. The Qwen2-72B-Instruct model exhibited robust performance, outperforming both GPT-4 and Mixtral-8x22B-Instruct in maintaining safety, though there remain areas for further improvement, particularly around sensitive content like pornography.

Conclusion and Future Directions

The Qwen2 series represents a significant advancement in open-weight LLMs, exhibiting impressive capabilities across diverse evaluation metrics. Their strong performance in language understanding, multi-modality, multilingual support, and long context processing indicates their potential for broad applications.

Future directions might focus on continuous pre-training to further enhance MoE models, expanding their applicability and understanding of complex instruction-following tasks, and refining safety mechanisms to further mitigate potential risks. The open availability of Qwen2 models on platforms such as Hugging Face and the provision of supplementary materials emphasize the commitment to fostering community innovation and research in AI.

Overall, Qwen2 models exemplify the next phase of development in LLMs, reflecting continuous improvements in architecture, training, evaluation, and safety—ushering in broader and more accessible AI advancements.