Sequential Modeling Enables Scalable Learning for Large Vision Models (2312.00785v1)

Published 1 Dec 2023 in cs.CV

Abstract: We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences, the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time.

PDF Abstract

Introduction to Large Vision Models

Large Vision Models (LVMs) are a conceptual leap in the field of computer vision, drawing inspiration from the success of LLMs like GPT and LLaMA. Unlike LLMs, which rely on vast amounts of linguistic data, LVMs are designed to learn directly from pixels. This development opens up possibilities for a range of computer vision applications without the need for language-based inputs. The core idea behind LVMs is to capitalize on the vast quantities of both labeled and unlabeled visual data available to train models that can understand and generate visual content at a high level of abstraction.

The Essence of Visual Sentences

A cornerstone of the LVM approach is the concept of "visual sentences." Raw images, videos, and various forms of annotated data are converted into a common format, which the model can use without additional meta-knowledge beyond the pixel data. This format revolutionizes the way that models learn from visual inputs, allowing for the seamless integration of diverse image types and annotations. This paper introduces a transformative model trained on a dataset containing an astonishing 1.64 billion images/frames, demonstrating the scalability and versatility of this approach.

Training Large Vision Models

The architecture of the LVM is built upon a large transformer that treats images as sequences of tokens. Just as LLMs learn to predict the next word in a sentence, the LVM is trained to predict the following visual token, using cross-entropy loss as its guiding metric. By adopting strategies from the natural language processing domain, such as autoregressive prediction, the model can be trained across diverse data while being capable of handling various visual tasks. During testing, these tasks are defined by designing visual prompts that the model responds to, reflecting its understanding of the visual content.

Empirical Findings and Potential

The experiments reveal several impressive behaviors: as model size and data quantity increase, the performance on numerous standard vision tasks improves. The model shows promise in handling out-of-distribution data and performing novel tasks, although these areas require further examination. The studies also highlight that the model's success stems from not just the volume of data but also the diversity present in the training set—underscoring the significance of a rich and varied dataset. The paper's breakthroughs mark a significant step forward in the development of scalable and versatile vision models, suggestive of a promising direction for future research in artificial visual perception.