Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

H2O-Danube3 Technical Report (2407.09276v1)

Published 12 Jul 2024 in cs.CL and cs.LG

Abstract: We present H2O-Danube3, a series of small LLMs consisting of H2O-Danube3-4B, trained on 6T tokens and H2O-Danube3-500M, trained on 4T tokens. Our models are pre-trained on high quality Web data consisting of primarily English tokens in three stages with different data mixes before final supervised tuning for chat version. The models exhibit highly competitive metrics across a multitude of academic, chat, and fine-tuning benchmarks. Thanks to its compact architecture, H2O-Danube3 can be efficiently run on a modern smartphone, enabling local inference and rapid processing capabilities even on mobile devices. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.

Citations (1)

Summary

  • The paper introduces a series of small language models optimized for consumer hardware and offline environments.
  • It details a staged training process using high-quality data and a decoder-only architecture with Grouped Query Attention.
  • Benchmark results demonstrate competitive performance on chat, fine-tuning, and academic tasks across 4B and 500M parameter models.

Technical Overview of H2O-Danube3: A Series of Small LLMs

The H2O-Danube3 paper introduces a series of small LLMs designed for efficient inference on consumer hardware, including mobile devices. Authored by Pascal Pfeiffer, Philipp Singer, Yauhen Babakhin, Gabor Fodor, Nischay Dhankhar, and Sri Satish Ambati from H2O.ai, the paper provides a comprehensive overview of the models' architecture, training procedures, and benchmarks.

Introduction

H2O-Danube3 addresses the increasing demand for compact, efficient LLMs that can be deployed on edge devices and operate in offline environments. These models aim to balance the trade-off between computational efficiency and performance. The authors build upon previous efforts in the field, indicating a continuation in the progression of open-source, small LLMs specifically tuned for diverse applications such as chatbots, retrieval-augmented generation (RAG), and various fine-tuning tasks.

Model Architecture

The report details two primary models within the H2O-Danube3 family:

  • A 4 billion parameter model trained on 6 trillion tokens
  • A 500 million parameter model trained on 4 trillion tokens

Both models employ a decoder-only architecture inspired by the Llama and Mistral models. Key architectural components include a Mistral tokenizer with a vocabulary of 32,000 tokens, a maximum context length of 8,192, and the use of Grouped Query Attention to optimize parameter efficiency. The 4 billion parameter model contains 24 layers with a hidden size of 3840, while the 500 million parameter model utilizes 16 layers with a hidden size of 1536.

Training Process

The training process for H2O-Danube3 involves three stages, gradually increasing the proportion of high-quality data while reducing noisy web data. The stages use progressively less web data: 90.6% in the first stage, 81.7% in the second, and 51.6% in the third. Additional data sources include Wikipedia, academic texts, and synthetic texts, with the models being fine-tuned for chat applications in the final stage.

Evaluation and Results

Academic Benchmarks

The models are evaluated against a variety of benchmarks, demonstrating competitive performance. Specific highlights include:

  • An accuracy of 50.14% on GSM8K, a math-centric benchmark
  • Strong performance on CommonsenseQA and PhysicsQA benchmarks
  • An average score of 68.98 across multiple academic benchmarks

The smaller 500 million parameter model also shows high efficacy, outperforming comparable models such as Qwen2-0.5B-Instruct in eight out of twelve benchmarks.

Chat and Fine-Tuning Benchmarks

Chat performance is assessed using MT-Bench and WildBench-v2 benchmarks, with the 4 billion parameter model surpassing models of similar sizes. The models also undergo internal evaluations, including blind voting and RAG benchmarks, to further establish their practical applicability.

For fine-tuning tasks, H2O-Danube3 demonstrates remarkable flexibility, achieving high accuracy across a range of classification datasets. This underscores the value of these models for use cases requiring specific task tuning.

Model Quantization

To facilitate deployment on edge devices, H2O-Danube3 models are made available in quantized formats. Table \ref{tab:quant} in the paper shows that 4-bit quantization substantially reduces model size with minimal performance loss, whereas 3-bit quantization results in more significant performance degradation.

Conclusion and Implications

H2O-Danube3 significantly contributes to the ecosystem of small LLMs by offering highly efficient models that enable myriad applications from on-device usage to specialized task fine-tuning. The open-source release under the Apache 2.0 license further democratizes access to these advanced models, potentially inspiring further research and development in the field of compact LLMs. Future developments could include more extensive parameter sweeps for fine-tuning and exploring additional quantization methodologies to enhance deployment capabilities across various platforms.

Overall, the H2O-Danube3 models represent a robust solution for deploying LLMs in resource-constrained environments, marking a substantial progression in the accessibility and functionality of open-source LLMs.