H2O-Danube-1.8B Technical Report (2401.16818v2)

Published 30 Jan 2024 in cs.CL and cs.LG

Abstract: We present H2O-Danube, a series of small 1.8B LLMs consisting of H2O-Danube-1.8B, trained on 1T tokens, and the incremental improved H2O-Danube2-1.8B trained on an additional 2T tokens. Our models exhibit highly competitive metrics across a multitude of benchmarks and, as of the time of this writing, H2O-Danube2-1.8B achieves the top ranking on Open LLM Leaderboard for all models below the 2B parameter range. The models follow core principles of LLama 2 and Mistral, and we leverage and refine various techniques for pre-training LLMs. We additionally release chat models trained with supervised fine-tuning followed by direct preference optimization. We make all models openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.

Authors (7)

Philipp Singer (21 papers)
Pascal Pfeiffer (7 papers)
Yauhen Babakhin (3 papers)
Maximilian Jeblick (6 papers)
Nischay Dhankhar (2 papers)
Gabor Fodor (67 papers)
Sri Satish Ambati (3 papers)

Citations (5)

View on Semantic Scholar

Summary

Introduction

The development of LLMs has been rapidly evolving, with significant strides in text generation, translation, and other language understanding tasks. The progression of these models has largely followed scaling laws that suggest an increase in parameters, dataset size, and computational resources delivers improved model performance. Additionally, practical applications are emerging that place more emphasis on the efficacy of smaller, finely-tuned models that operate within specified data and computational restraints. Against this backdrop, the H2O-Danube-1.8B model emerges, embodying a novel approach by balancing model size with computational efficiency and achieving impressive results even when carrying the constraint of less training data and exclusion of coding data compared to its counterparts.

Model Architecture and Training

The design of H2O-Danube-1.8B pivots around innovations in LLM architecture, specifically leveraging the efficient sliding window approach for local attention and incorporating the Rotary Positional Embedding (RoPE) for modeling sequence dependencies. Additionally, grouped-query attention is employed to decrease memory bandwidth overhead, overall presenting a model that consists of approximately 1.8 billion parameters. The training regimen showcases an adaptive increase in sequence length across different phases, culminating in a significant context length of 16,384 tokens. Notably, the training utilized FP8 calculations and Advanced Vector Extensions, illustrating a strategic selection of precision levels to uphold training stability and throughput. The training yielded an average token throughput of 292.7k tokens/s, underscoring both the efficiency and potential scalability of the model.

Empirical Results

Empirically, H2O-Danube-1.8B's credentials are validated across several benchmarks, stacking against similar-sized models like TinyLlama, Falcon, OPT, Qwen, and Stable LM 2. The results are commendable; H2O-Danube-1.8B outperforms Qwen across all benchmarks save for BoolQ, despite being trained on 2.2 times fewer tokens. In comparison to the Stable LM 2, which has been trained on a significantly larger corpus, H2O-Danube-1.8B still holds its ground, although it shows slightly lower performance in most benchmarks. Importantly, the H2O-Danube-1.8B model is made accessible under an Apache 2.0 license, offering greater flexibility for commercial usage and the opportunity for further research and fine-tuning within the open-source community.

Chat Fine-Tuning

Beyond the base model, a specialized chat variant named H2O-Danube-1.8B-Chat is provided, enhanced through supervised fine-tuning and direct preference optimization to specifically improve chatbot performances. When assessed through Mt-Bench, focusing predominantly on natural language tasks, H2O-Danube-1.8B-Chat revealed superior results for single turn conversations and showcased robust competencies spread across the benchmarks. This fine-tuned variant maintains stride with similar models for commonsense reasoning, world knowledge, and reading comprehension benchmarks, albeit falling short in highly specialized areas such as specific math reasoning benchmark, possibly due to the lack of targeted training data in those domains.

Conclusion

The roll-out of H2O-Danube-1.8B represents a significant contribution to the field of LLMs, particularly in permissive open-source models suitable for commercial and private enhancements. Its relative efficiency in the context of a smaller training corpus signals the potential for LLMs to be democratized and used economically on consumer hardware. It upholds the promise of potent, scalable LLMs that perform effectively across diverse tasks, setting a precedent for future endeavors to provide accessible high-quality foundation LLMs for widespread use.