Papers
Topics
Authors
Recent
Search
2000 character limit reached

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Published 8 Feb 2024 in cs.CV, cs.AI, cs.CL, and cs.LG | (2402.05935v3)

Abstract: We propose SPHINX-X, an extensive Multimodality LLM (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory

Citations (84)

Summary

  • The paper demonstrates significant efficiency by reducing redundant visual encoders and employing learnable skip tokens to optimize high-resolution image processing.
  • It streamlines training with a unified single-stage paradigm, ensuring uniform parameter optimization across extensive multi-domain datasets.
  • Benchmark tests reveal SPHINX-X excels in detailed captioning, visual question answering, and document detection across various model scales.

Overview of SPHINX-X: A Leap in Multi-modal LLMs

In the burgeoning field of AI, the integration of multi-modality into LLMs presents a remarkable frontier. The paper on SPHINX-X introduces a significant advancement in this space—the development of a series of Multi-modal LLMs (MLLMs) that not only embrace the complexity of multi-modality but also optimize model efficiency and scaling.

Innovations in SPHINX-X Architecture

At the core of SPHINX-X's design are several noteworthy modifications aimed at enhancing the model's performance and efficiency. Firstly, the elimination of redundant visual encoders marks a pivotal step towards streamlining the model's architecture. Specifically, the retention of only two of the initial four visual encoders stands out as a strategic decision to balance computational efficiency with the richness of visual semantics.

Moreover, the introduction of learnable skip tokens to bypass fully-padded sub-images addresses a common efficiency bottleneck in handling high-resolution images. This approach significantly reduces unnecessary computations for sub-images filled with zero-value pixels, thus optimizing processing time.

Perhaps the most significant architectural shift is the transition from a multi-stage training approach to a unified single-stage paradigm. This simplification not only streamlines the training process but also ensures that the model's parameters are uniformly optimized across the entire dataset.

Datasets and Model Scaling

A remarkable contribution of SPHINX-X is its comprehensive multi-domain and multi-modal dataset. This dataset not only spans a broad spectrum of tasks but also includes specially curated OCR-intensive and Set-of-Mark datasets. The inclusion of these datasets enhances SPHINX-X's ability to process and understand a wide array of visual and textual information with exceptional accuracy.

Scaling the parameters of LLMs presents a unique set of challenges and opportunities. SPHINX-X addresses this by presenting models across a spectrum of parameter sizes, ranging from TinyLlama-1.1B, suitable for fast mobile deployment, to Mixtral-8×7B, designed for complex reasoning tasks. This strategic variance in model size ensures that SPHINX-X is not just a one-size-fits-all solution but a versatile toolkit capable of tackling a wide range of applications.

Benchmarking Excellence

The evaluation of SPHINX-X across a variety of benchmark tests reveals its superior performance in areas such as detailed captioning, visual question answering, and document layout detection. Particularly noteworthy is its capability to outperform existing video-based models despite being fundamentally an image-based MLLM. This underscores SPHINX-X's remarkable ability to generalize and apply its understanding across different modalities.

Conclusion

SPHINX-X represents a significant stride forward in the domain of multi-modal LLMs. Its architectural refinements, comprehensive dataset, and strategic parameter scaling collectively propel it to the forefront of AI research. As we continue to unravel the complexities of multi-modality in AI, SPHINX-X serves as a beacon of innovation, efficiency, and versatility.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 106 likes about this paper.