SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models (2402.05935v2)

Published 8 Feb 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: We propose SPHINX-X, an extensive Multimodality LLM (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

PDF Abstract

Overview of SPHINX-X: A Leap in Multi-modal LLMs

In the burgeoning field of AI, the integration of multi-modality into LLMs presents a remarkable frontier. The paper on SPHINX-X introduces a significant advancement in this space—the development of a series of Multi-modal LLMs (MLLMs) that not only embrace the complexity of multi-modality but also optimize model efficiency and scaling.

Innovations in SPHINX-X Architecture

At the core of SPHINX-X's design are several noteworthy modifications aimed at enhancing the model's performance and efficiency. Firstly, the elimination of redundant visual encoders marks a pivotal step towards streamlining the model's architecture. Specifically, the retention of only two of the initial four visual encoders stands out as a strategic decision to balance computational efficiency with the richness of visual semantics.

Moreover, the introduction of learnable skip tokens to bypass fully-padded sub-images addresses a common efficiency bottleneck in handling high-resolution images. This approach significantly reduces unnecessary computations for sub-images filled with zero-value pixels, thus optimizing processing time.

Perhaps the most significant architectural shift is the transition from a multi-stage training approach to a unified single-stage paradigm. This simplification not only streamlines the training process but also ensures that the model's parameters are uniformly optimized across the entire dataset.

Datasets and Model Scaling

A remarkable contribution of SPHINX-X is its comprehensive multi-domain and multi-modal dataset. This dataset not only spans a broad spectrum of tasks but also includes specially curated OCR-intensive and Set-of-Mark datasets. The inclusion of these datasets enhances SPHINX-X's ability to process and understand a wide array of visual and textual information with exceptional accuracy.

Scaling the parameters of LLMs presents a unique set of challenges and opportunities. SPHINX-X addresses this by presenting models across a spectrum of parameter sizes, ranging from TinyLlama-1.1B, suitable for fast mobile deployment, to Mixtral-8×7B, designed for complex reasoning tasks. This strategic variance in model size ensures that SPHINX-X is not just a one-size-fits-all solution but a versatile toolkit capable of tackling a wide range of applications.

Benchmarking Excellence

The evaluation of SPHINX-X across a variety of benchmark tests reveals its superior performance in areas such as detailed captioning, visual question answering, and document layout detection. Particularly noteworthy is its capability to outperform existing video-based models despite being fundamentally an image-based MLLM. This underscores SPHINX-X's remarkable ability to generalize and apply its understanding across different modalities.

Conclusion

SPHINX-X represents a significant stride forward in the domain of multi-modal LLMs. Its architectural refinements, comprehensive dataset, and strategic parameter scaling collectively propel it to the forefront of AI research. As we continue to unravel the complexities of multi-modality in AI, SPHINX-X serves as a beacon of innovation, efficiency, and versatility.