Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models (2402.05935v2)

Published 8 Feb 2024 in cs.CV, cs.AI, cs.CL, and cs.LG
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Abstract: We propose SPHINX-X, an extensive Multimodality LLM (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

Overview of SPHINX-X: A Leap in Multi-modal LLMs

In the burgeoning field of AI, the integration of multi-modality into LLMs presents a remarkable frontier. The paper on SPHINX-X introduces a significant advancement in this space—the development of a series of Multi-modal LLMs (MLLMs) that not only embrace the complexity of multi-modality but also optimize model efficiency and scaling.

Innovations in SPHINX-X Architecture

At the core of SPHINX-X's design are several noteworthy modifications aimed at enhancing the model's performance and efficiency. Firstly, the elimination of redundant visual encoders marks a pivotal step towards streamlining the model's architecture. Specifically, the retention of only two of the initial four visual encoders stands out as a strategic decision to balance computational efficiency with the richness of visual semantics.

Moreover, the introduction of learnable skip tokens to bypass fully-padded sub-images addresses a common efficiency bottleneck in handling high-resolution images. This approach significantly reduces unnecessary computations for sub-images filled with zero-value pixels, thus optimizing processing time.

Perhaps the most significant architectural shift is the transition from a multi-stage training approach to a unified single-stage paradigm. This simplification not only streamlines the training process but also ensures that the model's parameters are uniformly optimized across the entire dataset.

Datasets and Model Scaling

A remarkable contribution of SPHINX-X is its comprehensive multi-domain and multi-modal dataset. This dataset not only spans a broad spectrum of tasks but also includes specially curated OCR-intensive and Set-of-Mark datasets. The inclusion of these datasets enhances SPHINX-X's ability to process and understand a wide array of visual and textual information with exceptional accuracy.

Scaling the parameters of LLMs presents a unique set of challenges and opportunities. SPHINX-X addresses this by presenting models across a spectrum of parameter sizes, ranging from TinyLlama-1.1B, suitable for fast mobile deployment, to Mixtral-8×7B, designed for complex reasoning tasks. This strategic variance in model size ensures that SPHINX-X is not just a one-size-fits-all solution but a versatile toolkit capable of tackling a wide range of applications.

Benchmarking Excellence

The evaluation of SPHINX-X across a variety of benchmark tests reveals its superior performance in areas such as detailed captioning, visual question answering, and document layout detection. Particularly noteworthy is its capability to outperform existing video-based models despite being fundamentally an image-based MLLM. This underscores SPHINX-X's remarkable ability to generalize and apply its understanding across different modalities.

Conclusion

SPHINX-X represents a significant stride forward in the domain of multi-modal LLMs. Its architectural refinements, comprehensive dataset, and strategic parameter scaling collectively propel it to the forefront of AI research. As we continue to unravel the complexities of multi-modality in AI, SPHINX-X serves as a beacon of innovation, efficiency, and versatility.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (19)
  1. Peng Gao (401 papers)
  2. Renrui Zhang (100 papers)
  3. Longtian Qiu (9 papers)
  4. Siyuan Huang (123 papers)
  5. Weifeng Lin (15 papers)
  6. Shitian Zhao (12 papers)
  7. Shijie Geng (33 papers)
  8. Ziyi Lin (12 papers)
  9. Peng Jin (91 papers)
  10. Kaipeng Zhang (73 papers)
  11. Wenqi Shao (89 papers)
  12. Chao Xu (283 papers)
  13. Conghui He (114 papers)
  14. Junjun He (77 papers)
  15. Hao Shao (25 papers)
  16. Pan Lu (42 papers)
  17. Hongsheng Li (340 papers)
  18. Yu Qiao (563 papers)
  19. Dongyang Liu (14 papers)
Citations (84)
Github Logo Streamline Icon: https://streamlinehq.com