xGen-small Technical Report (2505.06496v1)

Published 10 May 2025 in cs.CL and cs.AI

Abstract: We introduce xGen-small, a family of 4B and 9B Transformer decoder models optimized for long-context applications. Our vertically integrated pipeline unites domain-balanced, frequency-aware data curation; multi-stage pre-training with quality annealing and length extension to 128k tokens; and targeted post-training via supervised fine-tuning, preference learning, and online reinforcement learning. xGen-small delivers strong performance across various tasks, especially in math and coding domains, while excelling at long context benchmarks.

Summary

Technical Analysis of xGen-small: A Compact Transformer Family Optimized for Long-Context Applications

The research paper introduces xGen-small, a family of compact LLMs designed to address the challenges of deploying large-scale models in enterprise environments while optimizing for long-context applications. These models offer a pragmatic approach by balancing performance, context comprehension, and deployment efficiency, circumventing the need for excessively large architectures.

Model Architecture and Training Pipeline

The xGen-small series features two variants, with parameters scaling to 4 billion and 9 billion. These models utilize a Transformer decoder framework, optimized through a vertically integrated training pipeline that includes several critical components:

Domain-Balanced, Frequency-Aware Data Curation: This phase involves rigorous data filtering, deduplication, and targeted up-sampling to ensure a well-rounded and high-quality dataset. This comprehensive methodology allows the models to process extended context documents efficiently while maintaining breadth and depth.
Multistage Pre-Training: The pre-training regime involves distributional sharpening and quality annealing. The training progresses through various stages, initially focusing on diverse content coverage and subsequently honing in on high-quality subsets. This approach enhances the models' performance across a gamut of tasks with a learning rate schedule that supports stable learning.
Context Length Extension: xGen-small models support a context length extension up to 128k tokens, using Rotary Position Embeddings and sequence parallelism. This two-stage methodology ensures that the models remain efficient even at extraordinarily long input sequences, setting them apart from other models which often suffer performance degradation at extended lengths.
Targeted Post-Training: Comprised of supervised fine-tuning, preference learning, and online reinforcement learning, the post-training procedures refine the model's alignment, reasoning capabilities, and domain expertise. These strategies ensure that the models deliver high-quality, aligned outputs in terms of reasoning, helpfulness, and harmlessness.

Evaluation and Performance

The performance of xGen-small models is rigorously assessed across several benchmarks including mathematics (GSM8K, MATH), coding (HumanEval, MBPP), and general reasoning suites (MMLU, ARC-Challenge). Notably, the models perform competitively with larger counterparts while maintaining lower deployment costs and energy consumption. Specific performance highlights include:

Impressive gains in tasks requiring mathematical and coding proficiency, with substantial score improvements on benchmarks compared to similarly-sized models.
Robust performance across a range of context lengths, showcasing minimal degradation at 128k tokens—a testament to the effective context extension strategy employed.

Implications and Future Directions

This paper highlights the practicality of smaller-scale models engineered with attention to data quality and processing efficiency. xGen-small models are particularly suited for enterprise applications where latency, cost, and privacy concerns are paramount due to their optimized deployment characteristics.

This research paves the way for further exploration into scaling down model sizes while maximizing application to extended contexts. Future developments could explore adaptive data curation techniques and enhanced position embedding methodologies to further refine long-context understanding. Additionally, expanding applications in multi-domain settings and real-time processing could lead to more efficient and versatile LLMs.

In conclusion, xGen-small presents a compelling model family that leverages a strategic approach to overcome traditional size and context limitations, positioning itself as a versatile option in fields demanding high context comprehension and efficient resource utilization.