EXAONE Deep: Reasoning Enhanced Language Models (2503.12524v2)

Published 16 Mar 2025 in cs.CL and cs.AI

Abstract: We present EXAONE Deep series, which exhibits superior capabilities in various reasoning tasks, including math and coding benchmarks. We train our models mainly on the reasoning-specialized dataset that incorporates long streams of thought processes. Evaluation results show that our smaller models, EXAONE Deep 2.4B and 7.8B, outperform other models of comparable size, while the largest model, EXAONE Deep 32B, demonstrates competitive performance against leading open-weight models. All EXAONE Deep models are openly available for research purposes and can be downloaded from https://huggingface.co/LGAI-EXAONE

Summary

The paper introduces the EXAONE Deep series of language models, fine-tuned using specialized reasoning data and techniques like SFT, DPO, and Online RL to excel in reasoning tasks.
The EXAONE Deep models were evaluated on various benchmarks, with the 32B variant achieving competitive performance and a 95.7 pass@1 score on the MATH-500 dataset.
The research demonstrates that established fine-tuning techniques can significantly enhance LLM reasoning capabilities, with implications for improving automated reasoning tasks in diverse domains.

EXAONE Deep: A Comprehensive Evaluation of Reasoning Enhanced LLMs

The paper "EXAONE Deep: Reasoning Enhanced LLMs" from LG AI Research introduces a series of LLMs specifically fine-tuned to tackle reasoning tasks with notable efficacy. The primary lineup consists of three model variants: EXAONE Deep 2.4B, 7.8B, and 32B. These models are derived from the EXAONE 3.5 series and optimally tailored for superior performance in reasoning tasks such as mathematics, coding, and general knowledge benchmarks.

Model Training and Dataset

The EXAONE Deep series was trained using a reasoning-specialized dataset, employing techniques such as Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (Online RL). The data utilized comprised 1.6 million instances for SFT, 20,000 instances for DPO, and an additional 10,000 instances for Online RL, collectively covering approximately 12 billion tokens. The dataset emphasizes a chain-of-thought (CoT) process, aiming to enhance logical progression and self-correction capabilities within model outputs.

Evaluation and Results

The efficacy of these models was thoroughly assessed across various benchmarks, including MATH-500, AIME 2024 and 2025, CSAT 2025, GPQA Diamond, LiveCodeBench, MMLU, and MMLU-Pro. The EXAONE Deep 32B model illustrated competitive performance comparable to leading open-weight models such as QwQ-32B and DeepSeek-R1, and surpassed other models like DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Llama-70B. Notably, the models achieved robust accuracy in mathematical problem-solving benchmarks, with the 32B variant demonstrating a pass@1 score of 95.7 in the MATH-500 dataset, which is indicative of its advanced reasoning capabilities.

Implications and Future Directions

The research highlights the practical applicability of well-established fine-tuning techniques to elevate reasoning performance in LLMs. In terms of practical implications, the EXAONE Deep models are poised to improve automated reasoning tasks in diverse domains, potentially facilitating advancements in algorithmic trading, automated theorem proving, and other areas requiring intricate logical deduction.

For theoretical advancements, the paper showcases the promising attributes of incorporating extensive chain-of-thought training data to refine the reasoning abilities of large-scale LLMs. This approach opens avenues for future exploration into enhancing model transparency and interpretability, particularly in contexts demanding complex multi-step reasoning.

Limitations

While the EXAONE Deep models are optimized for reasoning, the paper acknowledges their limitations in broader applications demanding comprehensive instruction-following capabilities. Thus, for more general usage scenarios, it is recommended to employ base models such as EXAONE 3.5 Instruct.

Conclusion

The "EXAONE Deep: Reasoning Enhanced LLMs" paper underscores the effective application of conventional fine-tuning strategies to boost reasoning capacities in LLMs. LG AI Research provides a substantial contribution to the landscape of AI research by openly offering their models for further paper and development. The successful deployment of these models across various benchmarks illustrates the potential depth of reasoning that can be achieved, paving the way for future exploration and refinement of reasoning-enhanced LLMs.

Related Papers

Tweets

https://twitter.com/gm8xx8/status/1902085876211146960

https://twitter.com/TengX6/status/1903133513705988132

https://twitter.com/ShivamKumar212/status/1901924793664106963

https://twitter.com/arxivsanitybot/status/1901991353791222066

https://twitter.com/TengX6/status/1903128305206718833

YouTube

Show All Videos

HackerNews

EXAONE Deep: Reasoning Enhanced Language Models (2 points, 0 comments)